Inferring DNA sequences from mechanical unzipping data: the large-bandwidth case. 
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The complementary strands of DNA molecules can be separated when stretched apart by a force; 
the unzipping signal is correlated to the base content of the sequence but is affected by thermal 
and instrumental noise. We consider here the ideal case where opening events are known to a very 
. . . good time resolution (very large bandwidth), and study how the sequence can be reconstructed from 

the unzipping data. Our approach relies on the use of statistical Bayesian inference and of Viterbi 
■ decoding algorithm. Performances are studied numerically on Monte Carlo generated data, and 

' analytically. We show how multiple unzippings of the same molecule may be exploited to improve 

the quality of the prediction, and calculate analytically the number of required unzippings as a 
i— ( . function of the bandwidth, the sequence content, the elasticity parameters of the unzipped strands. 
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I. INTRODUCTION 



As DNA molecules are the support for the genetic information, the knowledge of their sequence content is very 
important both from the biological and medical points of view. Over the last decade the sequencing of various 

• genomes, in particular the human one, was done at the price of intense efforts. A traditional strategy for reading a 
PQ DNA molecule is based on the so-called Sanger method 0, Q . The DNA molecule is divided into fragments (with 

• N ^ 100 — 1000 base pairs); each fragment is amplified through PGR. The copies of each fragment are denaturated, 
. ^ ' and double-stranded DNA subfragments are synthesized under the action of DNA polymerases. The key point is 
, that each of the four nucleotides A, T, C, G is present in solution under its normal form at high concentration and 
qh' under a modified form, tagged with a base-specific fluorescent label and inadequate for further polymerization, at 
low concentration. At the end of the polymerization step many copies of each fragment are obtained. The copies 
of a fragment have a common extremity and have various lengths L, with a base-specific fluorescent base B at the 
end. The entire population of copies is sorted by length using gel electrophoresis and the sequence of the fragment is 
reconstructed from the list of terminal bases B{L), 1 < L < N. The method correctly predicts 99.9% of the bases of 
^-j- ' a fragment, but additional errors may arise during the reconstruction of the whole sequence from its fragments. 

Despite the success of conventional sequencing the quest for alternative (faster or cheaper) methods is an active field 
Cn| ' of research. Recently various single molecule experiments were carried out, allowing a direct investigation of DNA 
; mechanics and protein-DNA interaction [1, i, i, i, 0, i, i, M, M, IH, IH, Q M, M, E H El, US HHM HI ■ These 

• experiments provide dynamical information usually hidden in large scale bulk experiments, such as intermediate 
[ — , metastable states or fluctuations at the scale of the individual molecule. Remarkably, these dynamical effects are 

■ largely sequence-dependent in various experimental situations e.g. the opening of the double helix under a mechanical 
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UK 



, [T3, [Tl|, [H, [in J the digestion of a DNA molecule by an exonuclease [H, [l3] , DNA polymerization 
translocation through nanopores [22|, [23j . Understanding how much information about the sequence is 



■ contained in the measured signals is important. 



Hereafter, we focus on mechanical unzipping experiments, first introduced by Bockelmann and Heslot in 1997 [81]. The 
complementary strands are pulled apart at a constant velocity while the force necessary to the opening is measured. 
The average opening force for the A phage is of about 15 pN, with fluctuations around this value that depend on the 
particular sequence content. In a more recent experiment, Bockelmann, Heslot and collaborators have shown that the 
force signal is correlated to the average sequence on the scale of ten base pairs but could be affected by the mutation 
of one base pair a deq uately located along the sequence [Toj . 

Liphart et al. [13| and Danilowicz et al. |14| have performed an analogous experiment, using a constant force 
setup, on a short RNA and a long DNA respectively. As sketched in Fig[Tl the distance between the two strands 
extremities is measured as a function of the time while the molecule is submitted to a constant force. The dynamics 
is characterized by rapid zipping or unzipping jumps followed by long pauses where the unzipped length remains 
constant. Several repetitions have shown that positions and duration of these plateaus are largely reproducible, thus 
providing a 'fingerprint' of the sequence. The theoretical description of the DNA mechanical unzipping, at constant 
velocity and constant force, has been extensively developed [1, [l2, [13, HE HI, US, HE HI, HI, HH HI] ■ Models have been 
able to reproduce the force (for constant velocity experiments) or position (for constant force experiments) signals 
given the DNA sequence. It is a natural question to ask whether one could, inversely, get information about the 
sequence from experimental data [ssj . 
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FIG. 1: Sketch of a fixed-force unzipping experiment: the adjacent 5' and 3' extremities of a DNA molecule are submitted to a 
constant force /. The distance between the extremities, x, is measured as a function of time, x is proportional to the number 
n of open base pairs (bp) up to some fiuctuations due to the floppiness of the unzipped strands. The number n of open bp 
increases or decreases by one with rates Vo and Vc respectively, see dynamical model in Section [II Al 



This question was addressed by us in a recent letter [3J]. It was found that the error in the prediction e.g. the 
probability that a base is erroneously predicted decreases exponentially with the amount of available data. The decay 
rate was shown to depend on the sequence content, the applied force, the time and space resolution, ... The goal of 
the present paper is to provide a complete presentation of the numerical and analytical work supporting the results of 
[s^ l in the idealized case of perfect time and space resolutions. Though this case is not realistic from an experimental 
point of view, it can be studied in great detail. We show that the most important result, the exponential decay of the 
probability of misprediction with the amount of collected data, holds in more realistic situation where the bandwidth 
and the fluctuations in the extension of the DNA strands are taken into account. Our analysis focuses on the fixed 
force device data which is somewhat simpler from a theoretical point of view. 

In Section II we first introduce the dynamical model that, given a sequence, determines the unzipping signal. The 
inverse problem is then introduced and treated within the Bayesian inference framework. Section III reports the 
numerical results for the quality of prediction from numerical data obtained from the Monte Carlo simulation of the 
unzipping of a A-phage DNA. The analytical study of inference performances is presented in section IV. While the 
above study assumed the existence of infinite temporal and spatial resolution over the fork location the effects of 
realistic limitations are studied in Section V. A summary and discussion of the results is presented in Section VI. 



II. BAYESIAN INFERENCE FRAMEWORK 



The direct problem of fixed-force DNA unzipping is to determine, given the sequence of the molecule, the distribution 
of the stochastic measured signal, that is, the extension between the two strands extremities as a function of time. 
The direct problem is considered in Section fll Al and results are used in Section fllBI to address the inverse problem, 
that is, the prediction of the sequence given a measured extension signal. 

Throughout this section we consider that the experimental signal gives access to the number of open bases itself 
rather than the distance between the extremities of the unzipped strands. This is merely an approximation since, due 
to the fluctuations in the extension of strands, the number of open bases is not in one-to-one correspondence with the 
distance between the strands. Corrections to this simplifying assumption will be discussed in Section [VBI 



A. From sequence to signal: the direct problem 

In a previous work we have developed a theoretical description of the dynamics of DNA and simple RNA molecules 
under a constant unzipping force [28| . Despite its simplicity this model is capable of reproducing the unzipping data 
for a given sequence |13l . Il4| and the rezipping dynamics of a partially unzipped DNA . 

Let hi — A,T,C, or G denote the i*'' base along the 5' — > 3' strand (the other strand is complementary), and 
B = 62, • • • , b]y}. The free energy excess when the first n bp of the molecule are open with respect to the closed 
configuration {n = 0) is 
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FIG. 2: Free energy G (units of ksT) to open the first n base pairs, for the first 50 bases of the DNA A-phage at forces 15.9 
(dashed curve) and 16.4 pN (full curve). For / = 15.9 pN the two minima at bp 1 and bp 50 are separated by a barrier of 12 
kflT. Inset: additional barrier representing the dynamical rates Q to go from base 10 to 9 (barrier equal to (;s=2.5 ksT), and 
from base 9 to 10 (barrier equal to (;o(&9, 6io)=3 k_gT), see text. 
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FIG. 3: Number of open base pairs as a function of the time for various forces (shown on Figure). Data show one numerical 
unzipping (for each force) obtained from a Monte Carlo simulation of the random walk motion of the fork with rates 



and involves two contributions. The first free energy, called go{bi, is the binding energy of base pair (bp) number 
i; it depends on bi (pairing interactions) and on the neighboring bp bi^i due to stacking interactions, go is obtained 
from the MFOLD server [H, [s^, and listed in Table [J The second contribution, called gs{f) is the vi^ork to stretch 
the two opened single strands when one more bp is opened. The elasticity of DNA strands is described by a modified 
freely jointed chain with a Kuhn length Iq = 15 A and an effective nucleotide length i — 5.6 A 7]. The corresponding 
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TABLE L Binding free energies go{bi,bi+i) (units of k^T) obtained from the MFOLD server [sB, [sll for DNA at room 
temperature, pH=7.5, and ionic concentration of 0.15 M. The base values bi,bi+i are given by the line and column respectively. 
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FIG. 4: Fork position n as a function of time t = i x At with i integer-valued; the sojourn times on each base are given. We 
call ti the total time spent on base i, and Ui, di the numbers ofj^i+l,i^i— 1 transitions respectively. Assuming the fork 
does not come back to n = 1 or 2 at later times, we have: ti/At — 9, ui = 2, di = 0, and t-ij At = 5, M2 = 1, di = 1. 



free energy for forces up to 20 pN is 

g,(/) = 2/nn[sinh(z)/z]/z with z = fio/ikBT). (2) 

As an illustration the free energy G{n, /; A) of the first 50 bases of the A phage sequence, A — (Ai , A2 , . . . , A/v^) , is 
plotted in Fig [2] for forces / = 15.9 and 16.4 pN. At these forces the two global minima are located in n = 1 (closed 
state) and n = 50 (partially open state). Experiments on a small RNA molecule, called P5ab, [31 have shown that, at 
the critical force fc such that the closed state has the same free energy than the open one: G(0, fc, B) — G{N, fc, B), 
the barrier between these two minima is not too high, the molecule then switches between these two states. For long 
molecule e.g. A-DNA the barrier between the closed and open states mya become very large e.g. ~ 3000 k^T for 
the A-DNA at the critical force fc = 15.5 pN ^2§\. The time it takes to cross this barrier is huge and full opening of 
the molecule never happens during experiments (unless the force is chosen to be much larger than its critical, infinite 
time value). The experimental opening signal is characterized by pauses at local minima of the free energy G(n, /; A) 
and rapid jumps between them |14|. This dynamical behavior is reproduced (Fig[3|) when one considers that the fork 
separating the closed from the open regions along the molecule undergoes a random walk motion in the free energy 
landscape G(n, /; A) [2^. The fork, located at position n, can move forward {n ^ n + 1) or backward {n ^ n — 1) 
with rates (probability per unit of time) equal to, respectively, 

ro(6„,6„+i) = r exp [go{bn,bn+i)] , Tc = r exp [gsif)] (3) 

see Fig[TJ The value of the attempt frequency r is of the order of 10^ Hz [Tl!, [13] ■ Notice that the free-energies 
are measured in units of k^T. 

The expression ^ for the rates is derived from the following assumptions. First the rates should satisfy detailed 
balance. Secondly we impose that the opening rate depends on the binding free energy, and not on the force, and 
vice- versa for the closing rate rc- This choice is motivated by the fact that the range for base pairs interaction is very 
small: the hydrogen and stacking bonds are broken when the bases are kept apart at a fraction of A , while the force 
work is appreciable on the distance of the opened bases (« 1 nm). On the contrary, to close the base pairs, one has 
first to work against the applied force, therefore the closing rate Vc depends on the force but not on the sequence. This 
physical origin of the rates is reported in the the inset of Fig[2l Notice that, as room temperature is much smaller 
than the thermal denturation temperature, we safely discard the existence of denatured bubble in the zipped DNA 
portion. 

B. From signal to sequence: the inverse problem. 

We consider here the ideal case where the experimental setup is not affected by any instrumental noise: data are 
acquired with a infinite temporal resolution, and, in addition, the unzipped strands do not fluctuate in length. The 
latter assumption will be lifted in Section IV Bl while the case of a large but not infinite bandwidth will be studied in 
Section |VX1 

In the absence of DNA strands fluctuations the distance between extremities is exactly proportional to the number 
n of unzipped bases. The measured signal is thus the time trace T = (ig, zi, Z2, ■ ■ • , *m) where im is the position of 
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the fork at time m x Ai, and texp = M /S.t is the duration of the experiment. The infinite bandwidth assumption 
amounts to postulate that the delay Ai between two measures is smaller than the sojourn time on a base. Therefore 
successive positions im, hn+i differ by ±1 at most. A typical result of this idealized experimental situation is sketched 
in Fig. m The signal is stochastic due to the thermal motion of the fork in the landscape of Fig [5J two repetitions of 
the experiment do not yield the same time-traces. The probability of a time-trace T, given the sequence B, reads 

M-i I Ai ro(6i„^,6i„^^J if = i,„ + 1 

V{T\B) =11 \ ^t^c if ^.n+l = *™ - 1 • (4) 

This probability can be conveniently rewritten through the introduction of the numbers Ui and rfj of, respectively, up 
(im = i im+i = « -l- 1) and down (im — i im+i = * ^ 1) transitions from base i, as well as the total time ti spent 
on base i (number of sojourn events — i im+i = i, multiplied by At) in the time-trace T, 

r{T\B) ^l[[At r,{b,,b,+,)y" [Atr,]* [l - At (r„(6„ -f r,)] = C(T) x [] M (6„ u„ i,) (5) 

i i 

where 

M{b,,h+i;U,u,) = exp [gQ{hM+i) - r e»»(''-^'+i) U] (6) 

and C(T) = At"+'^ rf exp(— rc texp), u = Ui, d = di, and we have used the fact that At is small with respect to 
the average sojourn time on a base, (tq +rc)~^. Up to the multiplicative factor C(T) (which does not depend on the 
sequence B), the probability V{T\B) is equal to the product of terms M expressing the interactions between adjacent 
bases dH). 

The probability that the DNA sequence is B given the observed time-trace T is, in the Bayesian inference framework 

sa. 

The value B*(T) of the sequence maximizing this probability, for a given time-trace T, is our prediction for the 
sequence. In the absence of any knowledge over the sequence B the a priori distribution over the sequences, Vq, is 
uniform and equal to 4^^. A straightforward albeit important consequence of ([7]) is that B*{T) can be found from 
the maximization of V{T\B) ([5]). We will briefly see in Section [ill Bl an alternative way of predicting sequences from 
the probabihty ([7]). 

In practice B*{T) can be exactly found in a time growing linearly with N only with the Viterbi algorithm [stI. [ssj. 
The principle of the algorithm is equivalent to a zero temperature transfer matrix technique. We start from the first 
base and choose the optimal value of this base for each possible value of the second one; in this way we assign a 
probability P2 to each value 62 of the second base through P2(&2) = max^^ Af (&i, 62; ^i, Then we optimize on the 
second base, and obtain P3(h^) = max^j Af (&2, 63; t2, U2) P2(&2), and so on, 

Pt+i{bi+i) = max M{bi, bi+i;U, Ui) Pi{bi) (8) 
bi 

until we reach the last base N of the sequence. At each step, the maximum of ([S]) is reached for some base b'^°^'^^ {bi+i) 
that depends on the choice of the next base Once the value &^ that optimize PnibN) has been calculated, one 
obtains the whole optimal sequence using the recursive relation = &™'5^(6*) until the first base of the chain. 

A direct application of the procedure may produce substantial numerical errors due to the product of a large number 
of terms. It turns out convenient to introduce the logarithms of the probabilities, ni{bi) = — liiPi{bi), and solve the 
recurrence relation 

n,+i{b^+i) = min[TT,{b,)-go{b,,b,+i)u, + rea"^'''''''+'Hi\ , (9) 

obtained from ([5]). 

If more than one unzippings are performed on the same molecule, several time-traces ri,T2, ■■■,Tji are available. 
As all unzippings are independent of each other we have 

R 

V{T,,T2,...,Tn\B) = l[v{Tp\B) (10) 

p=i 

where the distribution of a single time-trace is given by ([5]). It is immediate to check that equations (jS]) and ^ are 
still valid provided Ui and ti are, respectively, the total number of transitions i i + 1 and the total time spent on 
base i. Total means that these numbers have to be computed from the all R time-traces taken together. 
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C. Estimators of performances 



As in the previous Section, we consider a time-trace T, and call B*{T) the sequence with maximal probability 
given those data. The true sequence is denoted by B^] in most applications — A, the phage sequence but we will 
consider other e.g. repeated sequences. We focus on the indicators 

I 1 if base i is correctly predicted ie. b*(T) = bj' 
I otherwise 

As the time-trace T is stochastic, so are the Vi{T)s. Our numerical and theoretical analysis aim at calculating some 
statistical properties of these indicators. For instance the probability that base i is not correctly predicted is given by 

= l-(«,(r)) , (12) 

where the average value (.) is taken over the probability of time-traces given the true sequence i?^. The 

two-points connected correlation function, 

X^,J = {v.{T) v,{T)) - {v,{T)) {v,{T)) , (13) 

tells us how much a correct prediction on base i influences the quality of prediction on base j. From this local 
quantities we define the global error and correlation functions through, respectively, 

N , N-d 



i=l i=l 

Note that the zero-distance correlation function is simply xo = e(l ^ e) in the limit of large sequences. 



III. NUMERICAL ANALYSIS 



A. Maximum probability prediction 

To test this inference method we have generated ideal opening data from the sequence A of the A~phage with a 
Monte Carlo procedure. Once a time-trace T has been produced a second program ignoring the phage sequence and 
based on the Viterbi algorithm allows us to make a prediction on the sequence, B* (T) . 



1. Generation of numerical time-traces 

The unzipping signal T is obtained through a Monte Carlo (MC) simulation with opening and closing rate defined 
by the model ([3]). To save time, at each MC step, the fork moves by one base pair, either forward or backwards, 
without remaining on the same base. Prior to the move the sojourn time t on the base where the fork is, say, i, is 
randomly chosen according to an exponential distribution with characteristic time r = l/(ro(i) + Tc). Then, the fork 
moves backward (i — s- i — 1) with probability q = TcT, and forward (i ^ i -I- 1) with probability I ~ q. 

The total number of open base pairs increases with the duration of the opening experiments i.e. with the number 
of MC steps as shown in Fig[5l The higher the force the more tilted the free energy landscape, and the larger is the 
number of open bases. With 10^ MC steps we typically open 290 bp at 15.9 pN, 450 bp at 16.4 pN, and 4700 bp at 
17.4pN; each numerical unzipping lasts for ~ 15 sec. 

The temporal resolution is introduced by filtering the output dynamics with a time step At. Fork positions are 
registered at times ti = i x At. Each time-trace is then preprocessed to obtain the numbers Ui oi i i -\- 1 transitions 
and the set of times ti spent on each base i. The set of data {u.;, ti} is then passed to the Viterbi procedure. 



2. Results for global estimators 

We show in Fig [S] the average fraction of mispredicted bases, e as a function of the force. For each time-trace 
we calculate the fraction of the opened bases that were incorrectly predicted, and then average over MC time-traces 
(samples) . e increases with the force because the number of predicted (open) base pairs (Fig increases, and the 
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FIG. 5; Number of open bases as a function of applied force, and for 5 x 10®, lO'^, 10^ Monte Carlo steps. Data are averaged 
over 100 samples. The durations of the unzippings are, respectively, of 7, 15, and 140 seconds. The DNA A-phage includes 
48,502 bp. In inset we report the theoretical estimate of the number of open base pairs, for lO'^ and 10* MC steps, of Section 
IIV D II 
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FIG. 6: Fraction e p4p of mispredicted bases as a function of the force for the A-phage sequence. Data are averaged over 100 
samples and shown with standard deviations. The dotted line e = 0.75 shows the failure rate for a random choice of one base 
among the four base values. 



time the opening fork spends on each base decreases. At a force of 16 pN 80% of the predicted bases are correct. As 
the force increases e approaches 0.75, which corresponds to a random guess among four possible bases. 

The quality of prediction is, not surprisingly, greatly improved by the repetition of the numerical unzipping on the 
same molecule. Let R denote the number of time-traces (of the same duration) available. We show in Fig. [7| how the 
error e decreases with R. Notice that the error is calculated over the bp that have been opened at least once in all R 
unzippings. When opening and closing several times the molecule, the opening fork makes multiple passages through 
the same portion of the sequence; in this way more information on the waiting and transition times on each base are 
collected, and processed altogether by the Viterbi algorithm. Figure [7] indicates that the error decreases exponentially 
with R, an observation that will find theoretical support in Section [IVI 



3. Results for local estimators 



Figure [SJ^ (dashed curve) show the errors ci for the first 450 bases of the A-phage a,t f — 16.4 pN. Comparison 
with the free energy landscape G{n, /; A) ([1]) at the same force shows that the best predicted bases correspond to 
valleys (Fig [5] top), in which the fork spends a lot of time, while prediction for bp located on the top of barriers are 
much poorer. In addition Fig. [Sj'V shows that the errors sharply decrease when the prediction is made from i? = 40 
unzippings. 

We have investigated in detail the decay of the error with R for two arbitrarily selected bases z = 6 and 
i = 27. Figure [IKtop) shows that bp 6 is located in a valley of the free energy landscape at force / = 16.4 pN 
while base pair 27 is located on a barrier at the same force. Figure [TO] shows that the error decays exponentially 




FIG. 7: A. Error e as a function of the number of unzippings for the phage. B. Same as A but without distinguishing A from 

T and G from C, see text. 




FIG. 8: Probability Ci (A) that bp i is not correctly predicted and Shannon entropy ai (B) for the first 450 bp of the DNA 
A-phagc. Inference is made from R = 1 unzipping (dashed line) and R = 40 unzippings (full line). The force is / = 16.4 pN, 
and data are averaged over 1000 MC samples. 
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FIG. 9: Top: free energy landscape for unzipping at force / = 16.4 pN. Local minima correspond to the portion of the sequence 
that are best predicted. Bottom: pairing free energy as a function of the base pair index, without and with window-average 
(Gaussian weight over 20 base pairs). 
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FIG. 10: Error rate (semilog scale) as a function of the number of repeated unzippings for base pairs i — 6 (left) and i — 27 
(right) arbitrarily selected, for forces / — VTA and 40 pN. Numerical data are averaged over 25000 to lO'^ samples, see error 
bars. 
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FIG. 11: Top: error for the first 50 bases of the A-DNA for R — 1,50,200 unzippings. Bottom: connected correlation Xj,i 
for bases j — G and j — 27 (black dots) for R = 50 unzippings. Xar.i is multiphed by 10 to be more visible; data correspond to 
/ = 40 pN (large force). 



with R, Ei ~ exp{—R/Rc{f,i)). The value of the decay constant Rc{f,i) strongly depends on the force and the bp 
index. At large force, / = 40 pN, bp 27 is more easily predicted than bp 6. Fitting of the numerical data yields 
40,1 = 6) = 113±2 and Rdf = 40, i = 27) = 25± 1. Correspondingly about 400 and 75 unzippings, collected 
and analyzed together, are needed to make the error smaller than 1%. At moderate force, / = 17.4 pN, predictions 
for bp 6 require less unzippings than for bp 27. We obtain Rc{f = 17.4, i = 6) = 2.2 ± 0.1, meaning that about 6 
unzippings are sufficient to reduce the failure rate to 1%, while Rc{f — 17.4, i = 27) = 13± 1 and about 40 unzippings 
are needed to reduce the error to the same amount. 

The quality of predictions exhibit strong correlations from base to base. We show in Fig [TlTtop) the error for 
the first 50 bases of A-DNA at high force / > 40 pN. We observe that groups of neighboring bases are locked-in in 
that their errors decay at the same rate when increasing the number R of unzippings. Sec for instance in Fig [TT] the 
blocks containing base 6, extending from bases 1 to 9, and base 27, including bases 26 and 27 only. All the bases i 
in a block have the same decay constant Rc{f,i). The lock-in phenomenon is visible from the connected correlation 
function Xj,i ifTS]) - shown for bases j = 6 and j = 27 in Fig fTlT bottom). Xij is essentially a step- wise function, with 
highest valuea for the bases i in the same block as j, and smaller values for neighboring blocks. The values of the 
decay constants at finite force as well as the blocks of locked-in bases will be found back analytically by the theory. 

4-. Entropy of predictions on a base 

The error e is defined from the exact knowledge of the true sequence. In practice one would like to be able to 
assess the quality of prediction b* over base i without referring to the unknown true sequence. To do so we calculate 
the four optimal sequences for each of the four possible choices of bi — A, T, G, C using the above Viterbi algorithm, 
starting from base i and going backward until the first base 6i is reached and optimized over; we call Pi{b\\bi) the 
probability ([8]) corresponding to this left part of the sequence. Then we repeat the process starting from base i 
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and going forward until the last base of the molecule is reached and optimized over, and we obtain the probability 
P/v(^Ar|&i) corresponding to the right part of the sequence. Hence we obtain the most likely sequence constrained to 
have base i equal to bi, together with its weight W{bi) = P(6o|6i) x P{b*j^\bi). After a proper normalization we define 
the probability 

Wib^ 

W{A) + WiC) + W{T) + W{G) ^ ' 

for each of the four base values at location i. The base with the highest value of ji is the one predicted by the usual 
Viterbi procedure. The Shannon entropy, once averaged over MC data, 

(T, = -(^ log4M(^0> (16) 

bi 

is small when one of the four possible bases has much higher probability than the other ones, and high (close to 1) 
when bases are equiprobable. Figure [HJ3 shows that the behavior of tTj follows the one of along the sequence (fig 
IHK)- In other words, if a base has a much higher probability ^ than the other three bases it is very likely to be the 
correct one. The Shannon entropy is a good estimator of the quality of the prediction. 



B. Average Bayesian prediction 

Instead of the maximum likelihood probability ii{bi) we can compute the probability iJ,f{b) that base i is of type 
b = A,T,C,G through the expression ([7]), 

B'\b'.=b 

where we have summed over all sequences constrained to have the value b for base i. This corresponds to an 
average Bayesian prediction in contrast with the maximum probability prescription of Section IIII Al Wc construct 
our predicted sequence B"^, assigning to each base i the argument b which maximizes probability fif. 

As in Section IIII Al we have studied the quality of the prediction for different values of the applied force and of 
the number of unzippings. The fraction of mispredicted bases in as a function of the force and of the number 
of unzippings shows a similar behaviour (not shown) to its maximum probability case counterpart (Fig. [S]and[7]); a 
theoretical discussion of this equivalence in the case of homogeneous sequences will be given in Section llV A 21 In order 
to better understand this similarity for the A-phage we have considered three ten bp long portions of its sequence, 

-^1^°^ = (fcj, fei+i, 6i+5, 6i+8, fcj+g), located at i = 200, i = 140, and i = 90. The choice of 

the locations corresponds to low (cr ~ 0), medium (cr ~ 0.5) and high (cr ~ 1) entropy regions (Fig[8j3). We obtain 
complete sequences of length N by setting the bases outside the 10 bp window to the values they have in B* . For each 
of the three locations we have calculated the probability ([7]) of the 4^'^ ~ 10^ sequences B with the recursive formula 
dS]), divided by the largest probability i.e. the one of the sequence B* . These ratios r{B) < 1 are called relative 
probabilities. Even in a high entropy region most of the sequences have a very small relative probability r{B) 1, 
meaning that the average sequence B"^ is actually very close to the most likely one, B* . It is interesting to notice 
that smaller and smaller relative probabilities r do not necessarily correspond to higher and higher 'mutations' from 
B* . The average Hamming distance (number of bases bi not equal to their values b* in B*) of sequences with relative 
probabilities in [r; r + dr] is not a monotonic function of r. Less and less likely sequences are not obtained from the 
ground sequence through the mutation of a larger and larger number of bases. Due to stacking interactions, in fact, 
bases are not independent and it can be energetically favorable to flip a group of bases instead of a single one. 



IV. ANALYTICAL STUDY OF INFERENCE PERFORMANCES 



In this section, we present the theoretical studies carried out to better understand how the quality of the prediction 
depends on parameters e.g. force, sequence content, number of repetitions of the unzipping on the same molecule, 
... We start with the high force case where closing basically never occurs. The analytical study of this situation is 
performed first in the absence of stacking interactions between bases, then in the presence of stacking interactions. We 
show that the overall quality of the prediction crucially depends on the number of repetitions of the unzipping. Later 
on we turn to the case of finite force where closing and opening both take place, and show how the finite force study 
can be exactly reduced to the high force one with a stochastic number of unzippings whose distribution is calculated. 
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Throughout Section IIV Al and Section IIVB II only two types of bases, cahed weak (W) and strong (S) have been 
considered instead of the four types A, T, G, C . The real case of four type of bases is taken back into account from 
Section IIV CI Considering two instead of four base types allows us to make calculation shorter; we however stress 
that there is, in principle, no obstacle to the extension of our calculation to the four bases case. It is also justified a 
posteriori by our finding. The error in predicting the true value of a base 6, say, h — A/ia the sum of the probabilities 
of predicting the other three bases, here b — G, b — T, and b ^ G. We show that, when a large amount of data is 
collected, one of these three probabilities, say, 6 = G, is much larger than the other two probabilities, turning the four 
base type problem into an effective two base types problem. 



A. High force theory: no stacking interactions 



A quick calculation shows that, for forces equal or larger than 40 pN, the fork separating open and closed regions 
never goes backward in the course of unzipping. Indeed, gs{f = 40 pN) ~ —8.6, and thus even for strong bases with 
pairing free energy go ~ —3.6, the ratio of closing over opening rates equals exp(gs(/) — go) ~ e~^, and is less than 
one percent. Bases essentially never close, and the matrix M (bi, Ui, U, di) ^ simplifies since di = 0, and Ui = 1 
for all open base pairs. We hereafter calculate the quality of prediction in this case. 

Let us simplify further the problem and assume that base pair interactions are essentially due to the presence of 
hydrogen bonds, and not to stacking effects. In other words, we replace go{bi, bi+i) with go{bi) where bi can take two 
values: W (weak) or S (strong). The free energies are go{S) < go{W) < 0, and A = go{W) — go{S) > denotes their 
difference. 

Consider an unzipping experiment (one run of our Montecarlo program) which opens N base pairs: di — for 
all i, Ui = 1 for i < N and = for i > N. The times U spent on the bases i — 1, . . . ,N are uncorrelated and 
exponentially distributed: 

P{U\bf) ^re^°^''^'> exp (-r ef"^^^) i,) (18) 
The distributions corresponding to W and S bases are plotted in Fig. [121 We define the mean sojourn time on base i, 

iU) = - exp(-<?o(6f )) . (19) 
r 

and the normalized time 

<»' 

Obviously neither {ti) nor are accessible from the measure which gives access to ti only. From ITSl) . the distribution 
of the normalized time is exponential with average value unity, 

Pi(t,) = exp(-r,) . (21) 



1. Maximum a posteriori prediction 



Given a random value for Ti drawn from distribution (|2ip . the most likely value for the base, 6*, is obtained from 
Bayes formula ([7]) by maximizing 

P{bi\Ti) cx re^"'-''*-' exp 

An immediate calculation leads to the conclusion that a weak base (respectively a strong base) will be correctly 
predicted if r,; < (resp. r,; > t^) where 

r^ = -^ and = ^ (23) 
1 — e ^ — 1 

Therefore, the probability that a base is wrongly predicted depends on whether the base is weak or strong, and reads 

A 



= ^3'^rPi(r)=exp(^-- 



'1^1 rfTPi(r) = l-exp(^--^ 



(24) 
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Plots of eY ^-nd ef as functions of the free energy difference A shows that the latter probability is smaller than the 
former. At high force, maximum likelihood prediction works better on weak bases than on strong bases. The two 
limiting cases are: 



A 



0: we find £1 = -7 = 0.368, while ej" = 1 — ^ = 0.632. This result is, at first sight, surprising since 



both bases should become equivalent when the free energy difference tends to zero. It is a consequence of the 
maximal likelihood principle: the reduced time t has a higher probability to be smaller than its average value 



w 



1 when A — > 0), and therefore weak bases are predicted with higher probabilities than strong bases 



independently of the true base 6,f . We shall see in Section flV A 21 that this artifact disappears when prediction 
are carried out from the average Bayesian framework of Section [ill Bl 

• A ^ 00: when the difference in free energies between both bases gets very large, both are asymptotically 
perfectly predicted. The convergence to 100% correct prediction is faster for weak than for strong bases: 



A( 



The above analysis can straightforwardly be extended to the case of predictions made from repeated experiments. 



Let us call R the number of unzippings, and t^^\t^'^\ 



pop , we have to maximize 

PR{k\{ 



, Tj' ' the (normalized) times spent on base i. Using formula 



oc 



exp 



-r e 



aoiK 



(1) 



cx exp \Rgo{bi) - r e^"^^'^'^"^''^'^ n 



where 



(25) 



(26) 



is the total time spent on base i. The maximization over hi is very similar to the one carried out from eqn 
We find that formula p4)) for the probabilities of correct prediction holds for R unzippings provided the single time 
distribution Pi is replaced with the distribution Pr of the total time (see Appendix IB ip , 



PRin) 



exp(-ri) 



(27) 



and the times r 



(i?-l)! 

n'm g^j-Q multiplied by R. The distribution of (not normalized) sojourn times after R unzippings 



are shown in Fig. [12] for W and S sequences. An important remark is that the distributions become more and more 
concentrated as R grows; in other words the times become less and less stochastic and are faithful signatures of the 
thermodynamic nature of the attached base. The probabilities that weak and strong bases are not correctly predicted 
after R unzippings are given by 



w 



dT Pr{t) ^ J [ R, 



Rt^v 
Rr^ 



RA 



1 



dr PRir) = 1 - [ R, 



RA 



where 



7(0,2;) 



dt 



(a-1)! 



(28) 



(29) 



is the normalized incomplete Gamma function. 

To better understand how the quality of predictions improves with the number of unzippings, we have analytically 
calculated the asymptotic expansion of e in Appendix [Ej From expression (j28p . we have when i? ^ 1, 



-fl(T-l-lnr) 



<^R 



(30) 



V2TrR (r- 1) 

with T = or (j23p depending on the type of base. As a consequence, achieving good recognition requires a 
number of unzippings (much) larger than 
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FIG. 12: Probability distribution Pr of the sojourn time t spent on a weak {go{W) = —1.06, {t)w ~ 0.8/is, dashed line) and 
strong ((?o(5) = —3.9, {t}s = 13.7/is, full line) bases. Time is rescaled by 1/R (see horizontal axis). The number of unzippings 
is R = 1 (left), R — 2 (middle), and ii = 10 (right). The probability e (|12|) that a W (resp. S) base is not correctly predicted 
is the area under the dashed (resp. full) curve right (resp. left) to the crossing point. As R increases time distributions are 
more and more concentrated, and the error gets smaller and smaller. 



This crossover number depends on the free energy difference A, but not on the type of base: Rc{t^) = Rc{t^)- Fig 
[T3l shows that Rc is all the more large than A is small. Definitions ((3T|) for Rc and (|23p for yield 

Rc^-^ , A^O. (32) 

This expression is a good quantitative approximation for i?c up to A ~ 3. We have checked the validity of these 
theoretical results through numerical experiments using the Viterbi procedure of Section lIVBi where the free energy 
matrix go was modified to avoid stacking interaction. Figure [13] shows the perfect agreement between numerical and 
theoretical results. 

That the effort (number of unzippings) necessary to ensure an excellent prediction essentially depends on the 
difference of pairing free energies between the two types of bases one wishes to distinguish justifies a posteriori the 
simplification of taking into account only two types of bases. The cases of interest are: 

• Weak bases represent A or T, and strong bases G or C: the free energy difference is estimated to be A ~ 2.8 
(obtained from gQ{T,A) = —1.06,go{G,T) = —3.9). The probability of wrong prediction for strong bases, e^, 
is plotted in Fig 1131 as a function of the number R of unzippings. i? = 5 unzippings are enough to achieve 
excellent base recognition. 

• Weak bases are A, strong bases are T: the free energy difference is A ~ 0.5 (obtained from gQ{T,A) — 
— 1.06,gQ{A,T) ~ —1.55). Figure [T51 shows it takes about 100 unzippings to reach 99% confidence in the 
prediction. Thus, the number of unzippings considerably increases if we want to precisely resolve all base pairs. 

Sequence prediction can be then done in a hierarchical manner. A small number of unzippings i? ~ 5 is sufficient to 
distinguish between A,T and G,C bases, in agreement with numerical simulation data shown in Fig[7K&B, while more 
unzippings R ~ 100 are necessary to clearly separate A from T, and G from C bases. In this regard, our prediction 
procedure always amounts to distinguish between two types of bases. 



2. Average Bayesian prediction 

Average Bayesian prediction consists in estimating the the probability of the correct base P{bf\ti) (thermal average) 
and averaging over ti (quenched average) rather than looking for the most likely base bi given the time ti spent on 
base i (IIII B|) . This procedure gives, in the general case of R unzippings, 

'^"io ^^l + exp(-i?A + r(eA-l)) ' ^"^^^ 

We stress that the above expression gives the value of e"^ for both W and S bases. The quality of prediction does not 
depend on base bf, in contradistinction with the maximal likelihood case, see eqn (|28l) . This independence is a direct 
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number R of unzippings 



FIG. 13: Errors on sequences of, respectively, strong (full line) and weak (dashed line) bases as a function of the number R of 
unzippings in the infinite force limit and without stacking interaction. The difference of pairing free-energies A is, from bottom 
to top, 0.5, 1, and 2.8. We show the results of numerical simulations for , with the error bars for A = 0.5, 2.8 (full dots: 
S sequence, empty dots: W sequence). 



consequence of Bayes inference formula. By definition indeed 

P{t\W)+P{t\S) 



Jo Jo 



This expression is left unchanged when we exchange S and W. Therefore 

e^'^ = e^^^ (35) 

Notice that this proof is quite general: it not only holds for any number R of unzippings, but also for any microscopic 
model yielding an explicit expression for P{T\b^). In particular, it remains true at finite force. As the number R of 
unzippings increases, the prediction approaches perfection, see Appendix lEl 



sin(7rCT) V2^(l-r) 

with 



(36) 



--^^ ^^^d a=i-^. (37) 

This asymptotic scaling is, to the exponential order, identical to the one obtained in the maximum likelihood case 
pop. Therefore average and maximum likelihood predictions are asymptotically equivalent. 



3. Relationship with Shannon entropy 

The above findings explains the similarity between the error p2p and the Shannon entropy (|16p observed in 
Fig. [SK&B. Let us call e and 1 — e the probabilities that the prediction on a base is correct and erroneous respectively. 
The Shannon entropy reads 

0- = -e Ine - (1 - e) ln(l - e) ~ -e Ine ~ est x e^^^'^" (38) 

when the number of unzippings is large with respect to Re- This explains why the error and the Shannon entropy on 
a base roughly behave in the same way, and essentially vanish when the number of unzippings is far above its critical 
value Rc- This result is left unchanged in the case of four, and not two base types. 
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B. High force theory: stacking interactions 

Let us now study how the presence of stacking interactions modify the above findings. With two kinds of bases, 
the pairing free energy matrix is a 2 x 2 matrix gQ{b,b'). Strong bases (S) are chosen to be 'average' bases from a 
repeated GCGCGC... sequence while weak bases {W) represent a repeated ATATAT... sequence. The values of the 
interactions are the average values of the pairing free energy in each of the four quadrants of the original 4x4 matrix: 
goiW, W) = -1.42, goiS, W) = goiW, S) = -2.39, and 50(5", S) = -3.50. We define the free energy differences 

A"^ = \go{W,W)-go{W,S)\ , A"" ^ \go{W, S) - go{S, S)\ . (39) 

whose values are A^ — 0.98, A'^ = 1.11. The calculation of the probability of correct base prediction is more difficult 
than in the absence of stacking but can be carried out using techniques issued from the statistical mechanics of one 
dimensional disordered systems [sl, S^l . 

We start from the recursive eqn ([5]) for the probability Pi{bi) that the base of the sequence is equal to hi. As in 
the no-stacking case, we introduce the normalized time through eqn (|20p where the average sojourn time on base 
i now reads 

(i.)- Jcxp(-5o(6f,6f+i)) (40) 

Defining TTi{bi) = —[lnPi{bi)]/R and introducing the local fields, 

^ Tr,{S) - 7:,iW) (41) 

we rewrite eqns (|8|9|) under the form 

h,+i=F,{h,,n) (42) 
where function Fi depends on base bf through the average sojourn time (|40|) . 

F,{h,T) = max[/i + go(M^,VK)-(7o(5,W^)-r-^(ef"(W',H/)_e9o(s,M/)^^^Qj 

+ min[-h,goiW,S)^goiS,S)-r^{ea«('^'''^ -eS°^'''''^)T] (43) 

As Ti is a stochastic variable with distribution (|27p (for R repetitions of the experiment), hi is itself a stochastic 
variable. Its probability distribution, Qi, obeys the recursion 

/•oo poo 

Q^+l{K+^)^ dnPnin) dh,Q,{h,) d{h,+, - F,{h„n)) . (44) 

Jo J -oo 



1. Repeated sequences 

The stationary solution Q = Qi oi eqn is calculated in Appendix [C] for the three repeated sequences = 
WWWW..., SSSS..., and SWSW... referred to as WW, SS, and SW sequences respectively. These sequences differ 
from each other through their sojourn times (t) (HO)) . When the condition A'^ < A"^ is fulfilled as is the case for the 
example considered above, the stationary field distribution is better written in terms of its cumulative function 



Q{h) = I dh'Q{h') , (45) 

with the result 



h 



A{h) if h< -A^ 

= { Tst^^'li^;' if < ^ < (46) 

if h> A'^ 



where 

• "'"'-'V "-{xfeA" -1)'"JJ 'V"'5(l 
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and 7 is the incomplete Gamma function (|29p . The parameter x is defined as the ratio of the average sojourn time 
(t) over its value for the SW sequence, 



SW 



(48) 



Knowledge of the field distribution allows us to calculate the average fraction e of mispredicted bases (|14[) and the 
nearest-neighbor (d — 1) disconnected correlation function 



Xi 



XI + (1 - e)' 



(49) 



where the connected correlation function is defined in eqn (I13|) . The calculations are reported in Appendix [P] Results 
are 



• WW sequence: we have x = e 



and 



WW 



1 



AS 



dhQ{~h)Q{h) , ixT) 



dis\WW 
R 



dT 



Pr{t) Q (- 



A 



w 



tR{1 



-A'- 



(50) 



• SS sequence: we have x = e^ , and 



ss 

^R 



AS 



dhQ{-h)Q{h), ixin 



dis\SS 
R 



dTPliiT) 







R 



(51) 



• SW sequence: we have x — 1; the probabilities that bases S and W are not correctly predicted are, respectively, 



sw.s 

<^R 



while the correlation function reads 



AS 



dhQ{-h)Q{h) , 



sw.w 

^R 



1 



sw,w 

^R 



(52) 



dis\SW 
R 



drPRir) 



1-e- 



R' 



l-e 



2 1 

^ 2 



-A 



w 



R' 



(53) 



The subscript 'R' reminds us that the above expressions hold for data collected from R unzippings of the experiment. 
Let us stress that the field distributions Q (and their cumulative functions Q) appearing in the expressions of e and 
xf'" above depend on the sequence through the ratio x, see eqns (|46l47l48p . 

The above theoretical predictions are shown in Fig.ll4land Fig. [15] for the three sequences, and perfectly agree with 
numerical experiments. For SS and WW sequences, we find that the quality of predictions tends to 100% accuracy 
as the number R of unzippings increases. It is shown in Appendix [E] that the asymptotic scaling of en is given by 



where r equals 



.WW 



A 



w 



V47ri? (r- 1) 



and 



A' 



1 — e ^ — 1 

for WW and SS sequences respectively. The above formula shows that the number of unzippings must exceed 

Rr 



2(r- 1 -Inr) 



(54) 



(55) 



(56) 



in order to achieve good recognition; we find Rc ~ 4.3 and R^ ~ 3.3 for WW and SS sequences respectively. The 
nearest-neighbor correlation function xi in Fig- HH is very small, even for R — 1 unzipping. The quasi-independence 
of predictions can be understood from the analytical calculation of Appendix [Dl and is essentially due to the fact that 
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b = A 
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TABLE II: Single base mutation decay constant R"^"^{xby), that is, value of the number of unzippings necessary for a good 
prediction at high force of a base & as a function of the contiguous bases x (row) and y (column) . See equation (|60|) for a precise 
definition. Left: the central base \s b = A; the most dangerous mutation is b = A ^ b' = T for all contiguous bases, except for 
xy = AA where b' — G . Right: the central base is fe = C; the most dangerous mutation is b = C ^ b' — G for all contiguous 
bases, except for xy = GG where b' — A. 
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TABLE III: Decay constant Rc{xb xb'), that is, number of unzippings necessary for a good prediction, at high force, of a 
bond between base x (fixed as in the sequence, value indicated in the leftmost column) and base 6 (value reported in the top 
line), potentially predicted to be of b' type. The most dangerous (requiring the largest number of unzippings) mutation b b' 
are given by b' equal to the complementary base of b, except for the cases TT TG, GG GA, GG GT. 



the sums of the diagonal and off-diagonal elements of the go matrix are equal. We have numerically checked that the 
correlation function is very small at all distances d, not only at high forces, but for all forces above criticality. 

The above findings can be easily understood from the findings of Section IIV A II Consider for instance the WW 
sequence. When R gets very large, very few bases 5* are (wrongly) predicted to be in the sequence. Call e the 
probability that a single base S is predicted. The predicted event VFS'VF violates two stacking interactions (bonds) 
with respect to the correct event WWW. Let us make the simplifying hypothesis that these two violations are 
independent: e = /i^, where the probability /i of one bond violation depends on the free energy excess A*^ (|39)) of 
the erroneous bond WS (or SW) with respect to the true bond WW. We estimate the value of /i from the theory of 
Section llV A II n = eR ((50)) with r = ^-WW^^ gge (|23I55I) . This simple argument explains why the quality of predictions 
is much closer to 100% success in presence than in absence of stacking (for the same number of unzippings). In 
particular, the cross-over number of unzippings Rc required to achieve good recognition is twice smaller in the former 
case (ISBl) than in the latter case ([5^ . 

The behavior of the error e for the alternate SW sequence is slightly more subtle to interpret, see Fig. [121 From 
expressions (j52l53p . we find (see Appendix [E|, in the infinite R limit, 

^sw^s ^ ^SWM - = i and (xO^ - (xi)S^ = \ ■ (57) 

The limit value of e is at, first sight, disappointing. There is 50% probability that a S* or is predicted at a given 
position i along the sequence, showing that our prediction is not better than a purely random guess! However, the 
nearest- neighbor correlation function x is much higher than the value (1 — e)^ it would have if there were no correlation. 
Indeed, we find that the probability that base i -I- 1 is correctly predicted provided its neighbor at position i is equals 

as the number of unzippings increases. In other words, only two sequences can be predicted, either the correct one 
SWSWSW... or its mirror sequence WSWSWS.... Actually, both sequences produce identical unzipping signals since 
the pairing matrix go is symmetric, which is not the case for the true matrix (Table 
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number R of unzippings 



FIG. 14: Probability of misprediction for repeated WW (full line) and SS (dashed line) sequences as a function of the number R 
of unzippings in the infinite force limit and in presence of stacking interactions. Here, go{W, W) = —1.5, go{S, W) — go{W, S) = 
—2.5, go{S, S) = —3.5. The strong and weak sequences are repeated SS and WW sequences respectively. Simulation results are 
shown with the error bars. Remark that the slope of In e is about twice the one for the non-stacking case with A = 1 (Fig. I13|l , 
see eqn (|56p and attached discu 




FIG. 15: Probabilities t^''^ and t^'^ of mispredicting, respectively, a S (black dots, full curve) and W (empty dots, dashed 
curve) base in a repeated SW sequence as a function of the number R of unzippings in the infinite force limit. The stacking 
interactions are go{W,W) = —1.5,go{S,W) — go{W,S) = —2.5,go{S,S) — —3.5. Simulation results are shown with the error 
bars, while continuous curves correspond to the theoretical expression (|52|l . As R grows the prediction on a single base becomes 
essentially random (e i) since SWSW... and WSWS... sequences cannot be distinguished from one another. 



C. High force theory: decay constants Rc for heterogeneous sequences 

Let us turn to the realistic case of a non-repeated sequence with four base types, and stacking interactions between 
neighbouring bases. From the numerical findings of Section fill Al and the theoretical analysis of repeated sequences 
of Section [IV Bl we expect the error on a base to decay exponentially with the number R of unzippings. In a first step 
we estimate the decay constant within a single mutation assumption: all bases are assumed to be correctly predicted 
but the one under study [13] ■ However this single mutation assumption is not always correct. We will show that the 
decay of the error in predicting one base is often due to the difficulty in predicting a whole block of co-mutated bases, 
and give the corresponding expression of the decay constant Rc- 
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FIG. 16: Connected correlation function xi at distances d = 1 for, respectively, repeated SS (left panel) and WW (right panel) 
sequences as a function of the number R of unzippings in the infinite force limit (/ — 40 pN in simulations). For comparison 
we show the d — correlation function, xo = ^(1 ^ e)- 



1. Decay constant in the single base mutation assumption 



Consider a triplet of contiguous bases along the sequence, xby and let us start by calculating the error due to a 
predicted sequence with a single base mutation e.g. b ^ b' when keeping bases x and y to the correct values. In 
this case the argument following eqn ([56|) and obtained in the case of repeated sequences is still valid. As a result of 
stacking interactions the probability e^'^'^ of this mistake is the product of the probabilities e^''^^^ and e**^^^ ^ of 
either bond violation. The large R behavior of the error probability 

^..e-^/^^^"^-''^) (59) 

on base b is then obtained by selecting the worst value for the mutation b' , 

1 1 



1 

= mm 

R^J"'{xby) b'i^b) 



R„{xb ^ xb') R,{by ^ ¥y) 



(60) 



where Rc{xb — > xb') is the decay constant of the error obtained in the no-stacking theory of Section flV A II (applied 
here to a bond and not to a base violation); it is given by formula ((3T|) with A = 50(2;, b)—go{x, b') and r = A/(e'^ — 1). 
The values of Rc obtained from formula ((60|) are given in Table [TTl (after rounding to the closest integer) for base 
triplets xby with central base b = A and b = C respectively. The values of Rc for triplets with central bases b = T and 
b = G can be deduced from the decay constants of the complementary triplets, expressed in reversed order, due to 
the symmetry of the interaction matrix go of Table |T] e.5. i?J™(ATr) = i?^™(AAr). The value b' of the most difficult 
base to distinguish from b, see (IBH]) . is T when the central base is A and G when the central base is C, except in the 
AAA, CCC cases where b' = G, b' = A respectively. 



2. Propagation of errors, and blocks of locked-in bases 

The above single base mutation offers only a lower bound to the true value of the decay constant Rc{i) of the 
error in predicting base pair i. Strictly speaking, to calculate Rc{i), one must consider all the 3 x 4^~^ sequences 
where base i differ from its value in the true sequence, and find among those sequences the one which requires the 
largest number of unzippings to be discarded. In other words errors on bp i may result from the difficulty of correctly 
predicting a block of more than one bp located around bp i rather than this bp alone. 

We start by defining the decay constant for the large R behavior of the single bond misprediction probability 
^xy^x y jtqj, contiguous mutations {xy x'y'), 

^xy^x'y' ^ ^-R/R^(xy~*x'y') j-g-|^-j 
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where Rc{xy x'y') is given by eqn (|23|) with A = go{x, y) — go{x' , y') and r = A/ (e^ — 1) ((3T|) . We then define, in 
the maximum hkehhood framework, the probabihties (with respect to the random variables tj) ^~*{h) and lJ'*~(b) of 
predicting base pair i to be of 6-type when, respectively, the bases located to the right and the left of i are ignored. 
We assume that 

ti-*{b) = e-"'^^^'''^ and ^it (b) = e'^^^ '■^^ (62) 

for a large number R of unzippings, with boundary conditions Tr{*{b) = and 7rj^(6) = for all b. These probabilities 
can be evaluated from the probabilities of the most dangerous subsequence to the left and right of base pair i, according 
to the recurrence equations 



T^rib') = niin (^nl^j^ib) + 



1 



Rc{bf:bf_^,^b'b) 



remember 6f denotes the true type of bp i. These recurrence equations have a simple meaning. The probability that 
bp i is of b' type, when there is no base to the right of i, is simply given by the sum over b of the probability that bp 
i — 1 is of 6 type times the probability of predicting the bond bb' instead of Notice that recurrence eqns ((63)) 

are simply the asymptotic counterpart of eqn (|44p in the large R limit (for four and not two base types). They can 
be obtained from eqn © and (|D3p by choosing for ti the time having equal probabilities with the true bond bf^^bf 
and the erroneous bond b b' distributions [13] . 

The decay constant Rc{i) of the error on bp i is obtained by selecting the most dangerous value for the type 5, 

^ min (7rr(6)+7rr(fo)) • (64) 



Rc{i) b^bf 

In general Rc{i) differs from the single mutation value, R^™'{i). The latter depends only on the base and its two 
neighbors while the former depends on the whole sequence. Equations ()63p and (|64|) can be interpreted by considering 
'^Vib) + '^i*{b) as the free energy for the lowest excited state (sequence) with the base i fixed to a value, b, distinct 
from the one, bf, in the ground state (real sequence). If the base i has a very large value for R^'^{i), because both 
the bonds on the right and on the left of the base have a large Rc (see eqn l60| , the most dangereous sequence is 
exactly this 'single mutation' sequence. In this case the minimum over b in (|63p is exactly obtained for b — bf_i and 
b = bf_^j^, and the recursion halts after the nearest neighbors. However, when the bond constant i?^™ is small, we 
can expect that it is less costly, in terms of free energy, to propagate the excitation at site i in a configuration where 
the base and its neighboring base are both mutated into their complementary values. The decay constant Rc for 
such a bond is indeed large because it is difficult to distinguish two bases from the complementary ones (Table IIIII) . 
This 'defect' propagates, in the recurrence eqn ([M)) . until an interface with a large value for Rc is found. Obviously 
this propagation mechanism takes place on both sides of bp i. The most dangerous excitations are thus blocks of 
complementary bases of the real sequence. The bases in a block have then roughly the same Rc and are locked-in 
together fFigfTTjl. 

The high force behaviour of the errors (for R = 1,50,200), obtained by the numerical inference and shown in 
Fig [TT] agree with these theoretical results. The theoretical values for the decay constants Rc{f > 40piV, i) obtained 
from (|63l64p are shown in Fig. [20] (dotted line) . By solving eqn ([63]) we find that bp i = 6 belongs to a block extending 
from bp 1 to 9. The boundary bp 1 has Rc on the left equal to oo and bp 8 has Rc{GA — > CA) — 139. From eqn ([55]) 
we obtain Rc — 114 for the whole block 1-9. This value coincide with the decay of the error at large R found from 
simulations and shown in FiglTOl We obtain i?™"" ~ 113 ± 2 from a fit of log vs. i? at / = 40 pN. [43] Base pair 27 
belongs to a block on the right spreading over the whole sequence down to base 1, while the block on the left stops 
on the base itself. The number of unzippings needed for a good prediction of bp 27 is smaller: we obtain from theory 
Rc = 24, and from simulation i?""™ = 25 ± 1. Note that the propagation of the error by blocks of complementary 



bases in this section go beyond the single mutation approximation reported in [34 1 . 



D. Moderate force theory 

1. On the number of single-base openings 

We now investigate the case of unzipping under a finite force. The opening fork may go backward, closing a 
previously open base pair, and reach this base pair later. Therefore the number Ui of opening transitions i ^ Ui, 
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FIG. 17: Average number (ui) of openings of bp i for the A-phage sequence during one unzipping for forces 16.4 and 17.4 pN. 
A. theoretical values in the limit of infinite time. B. numerical values from MC simulations with M = 10^ steps. Note that 
the infinite time theoretical values coincide with the numerical values up to some base index imax such that . , . (ui) <C M 
e.g. i,r,ax ^ 200 for / = 16.4 pN and M = 10* steps. 



is not always equal to unity but is stochastic and varies from experiment to experiment, and base to base. To calculate 
the distribution of Ui it is convenient to think of the opening and closing process as an unidimensional random walk 
where, at each move, the probability to go backward and forward (closing and opening transitions respectively) are 
equal to qi and 1 — qi respectively, with 

" '^^JIT^r^^^Jb-blTT) ■ ^^^^ 

For forces larger than the critical force, we have qi < ^: the random walk is submitted to a forward drift and is 
transient. We define the probability of escape, Ei, as the probability of never reaching back position i starting from 
position i + The case of infinite force corresponds to = 1. For a homogeneous sequence the free energy landscape 
G{n, /) in which the random walk takes place is simply a tilted line; E = {1 — 2q)/{l — q) depends on the force and 
on the sequence type. For a heterogeneous sequence the free energy landscape G(n, /) is more complex (Fig. [9]), Ei 
depends not only on the force and on the base type hi (and on its neighbor fe^+i) but also on its environment e.g. 
whether base i is located in a local minimum or in a local maximum of the free-energy landscape. We show how to 
calculate E^ in Appendix |F] for any given sequence. 

The distribution pi of the number Ui of opening transitions z — > i + 1 during a single unzipping is simply obtained 
from Ei and reads 

pi{ui)^{l~E,Y'-^ E, (66) 
From equation l|66p we have that the average number of openings of bp i is 

= ^ ■ (67) 

(ui) is shown in Fig. [T7] for forces / = 16.4, 17.4 pN for the first 400 bases of the A phage DNA sequence. Theoretical 
values for (ui) are obtained in the limit of infinite time while MC simulations (or experiments) duration is finite. 
Call i'"''* the expectation value of the last-passage time of the fork at site i; 4°''* is finite since the random walk is 
transient. Clearly theoretical and MC values for (ui) will coincide for bases of indices i < imax where t\'^^^ is equal 
to the duration of the simulation. In practice we estimate imax through the condition X]i<i i'^i) — where M is 
the number of MC moves. The outcome for imax is plotted in the inset of Fig [51 For instance, as shown in Fig. [T71 
imax — 200 for / = 16.4 pN and M = 10®. (ui) varies a lot from base to base, and reaches values up to 10* (for the 
considered force). 

The generalization of the calculation of the distribution pR,{ui) of the number of openings of base pair i to the case 
of R unzippings is immediate (Appendix IB 2p . The result is the R*^^ convolution power of pi, and reads 

PH{u.)=h"_^\{l-E,r-^E^ . (68) 
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2. Error in predicting a base in the absence of stacking 



The number of opening transitions of a base at finite force, u^, plays the same role as the number R of repetitions of 
the unzippings at large force. As the fork visits again and again the same base pair more and more data are collected 
on the sojourn time ti on this base and the prediction error becomes smaller and smaller. However, contrary to R, 
Ui is a stochastic variable. The error in predicting base pair i of type bi = W, S, in the absence of stacking is then 
obtained by averaging the error on this bond at large force and after Ui unzippings, e^' (j28p . over the distribution pji 



Ui>l 



(69) 



where the / subscript indicate that the above formula holds for a finite force. A detailed derivation of eqn ([55)1 is 



1 from (|65p. Pfl(ui) Sui,R from 



and e 



f.R 



given in Appendix IG II In the limit of large force Ei 
expected. 

Error Ij69p can be easily computed when the error e^'. is replaced with asymptotic expression (|30p . Using the 
expression for the generating function of the probability pn with argument exp(— given in Appendix IB 21 we 
obtain 



In 



/Ra 



e^;^ ~ e"«/«=(/^*) with i?c (/,*) = 
The above decay constants Rc can be approximated with the simpler expression 



1 



Rc 



(70) 



(71) 



which are quantitatively accurate unless the number of required unzipping at large force, Rc-, becomes much smaller 
than {ui) i.e. close to the critical force. This formula simply expresses that the effective number of unzippings to 
correctly predict base i at finite force is i? x (u^) rather than R. Recall that the value of the decay constant of the 
error at high force, Rc, depends only on the free energy difference between W and 5* bases. At finite force this decay 
constant is roughly divided by {ui). The latter depends on the whole free energy landscape around the base. Therefore 
at finite force, even in the absence of stacking interaction, the error on a base depends on the whole sequence of bases. 
Moreover bases with a large Rc that are in a valley of the free energy landscape can be better predicted than bases 
with a small Rc located on the top of barriers in the landscape. 

Let us apply the above result to the case of a homogeneous sequence, with two base types, b = W, S. The decay 
constant Rc PT|) at high force depends only on the free energy difference A between W and S bases. For a homogeneous 
sequence the average number of openings of each base is simply (u) ~ jE^^j where q is obtained from formula (j65p 
with go{bi, fe^+i) = go{b). In Fig[TH]we plot the error for W bases for A = 2.8 (to distinguish a sequence of bases A or T 
from a sequence of bases G or C) and A = 0.5 (to distinguish a sequence of A bases from one of T bases, or a sequence 
of C bases from one of G bases). The plot for a repeated sequence of S bases is similar. As shown in Fig [TH] the error 
sharply decreases when the force reaches its critical value from above e.g. fc = 9.25 pN for .go(^) = —1.1 k^T. As 
shown in Fig [TH] the decay constant ([70]) 



Rcif) = 



In 



(l-(?)e 



l/Ra 



q 



1 - 2q 



(72) 



obtained by approximating with a pure exponential is in perfect agreement with the numerical calculation of 
formula ej?'^. The simplified expression (|7T|) 



Rcif) = i?c X 



1 - 2q 

1-g 



(73) 



is in very good agreement with Rdf), except in the case A — 2.8, f = fc + 2 pN for which the decay constant is very 
small. 

The value of Rdf) is plotted as a function of the force in Fig [19] for various sequences, and allows us to draw 
the phase diagram for the prediction in the force vs. number of unzippings plane. The prediction becomes perfect, 
^/ i? ^ 1, if the number R of unzippings is (much) larger than some crossover value Rc (17^ . It appears that Rc{f) is 
always smaller than its infinite force value Rc, and vanishes when the force reaches the critical unzipping force from 
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FIG. 18: Probability e of misprediction on repeated sequences of W (empty dots, dashed lines) and S (black dots, full lines) 
bases for pairing free-energy differences A = 2.8 (A) and A = 0.5 (B) in the absence of stacking. For each case we show the 
error as a function of the number R of unzippings for forces above the critical force by 0.5, 2 and 10 pN. The decay constants 
have for A = 2.8 the following values: R^if = oo) = 32; for / = /c + 10 pN, Rc = 28.5; for / = + 2 pN, Rc = 10.9; for 
/ = /c + 0.5 pN, Rc = 3.4. 
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FIG. 19: Phase diagram in the number of unzippings vs. force plane. Efficient prediction is possible above the critical line 
Rcif) ((72)) . Here go{W) = —1.06, goiS) = —1.55. The full line indicates the repeated W sequence, the dashed line corresponds 
to the repeated 5* sequence. For forces smaller than the critical value /c ~ 9 pN for the W sequence, fc — 12 pN for the S 
sequence (vertical lines) the molecule remains closed. At large force the number of required unzippings reaches a common value 
Rc ~ 30. 



above, / — > . In this limit, q —> ^: the motion of the opening fork becomes purely diffusive, and each base is visited 
a very large number of times going to infinity for an infinite duration of the experiment. Predictions made from a 
single unzipping are reliable provided Rdf) < 1 i.e. the force / does not exceed by a large amount its critical value 

/c; 

where dc = \dgs/df{fc) \ is twice the extension of a DNA single strand monomer at the critical force, and we have used 
expression (j3ip for Rc. Typically, dc ^ 1 nm ~ 0.25 k^T/pN, leading to / — /c < ^A^ pN with A expressed in units 
of k^T. Notice that this theoretical result does not consider the actual number of open base pairs, which decreases 
as the force is lowered to its critical value, but only the quality of their prediction. 



3. Results for heterogeneous sequence in presence of stacking interactions 

The above theory tells us how many unzippings are necessary to recognize a base type from another at moderate 
force, when the pairing free energies of these two base types differ by A and when the fork opens the base (ui) times 
in each unzipping. It can be applied to the case of bond and not base recognition as we have done at large force in 
Section ITVCTl The number of unzippings Rc{f,i,bfb^_^_i — > bb') necessary to recognize that the bond between base 
pairs i and « + 1 is not bb' is given by expression ([70)1 or ([7T|) with Rc substitued with Rc{bfbf_^^ bb'), see Section 
IIV C 11 which depends on the biochemical parameters go{bf, bf_^_^) — go{b, b') given in Table HIl 
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Base pair i 

FIG. 20: Theoretical values for the number Rc{f,i) of unzippings necessary for a good prediction of base i at force / = 17.4 
(full line) and / > 40 pN (dashed line) for the first 400 bases of the A phage sequence obtained from formula (|70|) 



The decay constant of the error on base i at finite force, Rdf, i), is calculated by applying the recursive formula 
and the minimization formula (j64p after replacing the bond decay constants at infinite force with the ones at finite 
force, 



(75) 



with bondary condition 7rj^^(6) = 7r]^j(6) = 0. The minimization condition then reads 

1 



Rcif,i) b^b\; 



Kf{b)+Hfib)) ■ (76) 



Figure HOI shows the values of Rc{f,i) at / = 17.4 pN (full line) for the first 400 base pairs of the A-phage derived 
from (|70p . Rc{f,i) is in very good agreement with the decay constant of the error obtained through the numerical 
inference procedure and shown in Fig[51\-. Indeed, roughly, for all bases with Rc{f,i) < 15 the numerical inference 
errors goes to zero with i? = 40 unzippings. For a more precise comparison we have focused on two specific bases 
(Figlini). 

Base pair 6 is located in a valley of the landscape G at force of 17.4 pN, hence the number of openings of the base, 
(ui), and of its neighbors, (uj) with j close to i, are large e.g. (ui) = 28000, (wg) — 60 as shown in Fig[T71 The decay 
constant of the error quickly decreases with the force from Rc{f > 40 pN, i = 6) = 114 to i?c(/ = 17.4 pN, i = 6) = 2; 
these theoretical values are in very good agreement with the numerical findings of Fig 1101 Moreover the connected 
correlation function Xifi at / = 17.4 pN has non-zero value up to the base i = 20. Solving the recursive eqns (|75I76[) 
we found that the decay of the prediction error on i = 6 originates from a 20 defect-sequence where bases 1-20 are 
locked-in into their complementary values with respect to the true sequence. 

Base pair 27 lies, on the contrary, on a barrier of the free energy landscape and the numbers of openings (at a 
force of 17.4 pN) of this base (and its neighbors) is smaller: (^27) = 1.5 as shown in Fig. [iTl The decay constant 
decreases slightly when the force diminishes, from Rdf > 40 pN, i = 27) = 24 to Rdf = 17 pN, i — 27) — 15. These 
theoretical values agree very well with the fit of the numerical simulations in Fig [TOl Moreover the decay of the 
prediction error on base 27 at / = 17.4 pN came from a two-defect excitation of bases 26-27. Note that numerical 
results are limited by the finite number of samples from which the error e.^ is calculated. The number of samples Mp 
necessary to estimate accurately the error must be much larger than the inverse of the probability of misprediction. 
With Mp = 2 10^ (Fig [TU)) errors smaller than e = 10~^ cannot be measured. As e decreases exponentially with R, 
Mp must scale as exp(i? n) with fi > Rc to reach a good estimate of i?c- Finite sampling could also lead to statistical 
bias due to the large deviation fiuctuations of Ui. We show that these effects are negligible in Appendix |T1 



E. Inference from two-way unzippings 



We hereafter consider that the molecule can be unzipped from both extremities (two-way opening) and want to 
infer its sequence from the data collected in both directions. This investigation is motivated by the observation that 



25 



R"*-(f,i)''' 






100 200 300 400 500 

Base i 



FIG. 21; Decay constant R^^ {f^i) of the prediction error on base i for the first 450 base pairs of the A phage DNA, at force 
/ = 17.4 pN from the two-way unzipping numerical unzipping. 

the free energy landscape is flipped i.e. multiplied by —1 when the molecule is opened from the other extremity. 
Bases that were located in local maxima in the landscape, hence poorly predicted, become local minima in the new 
landscape, and are much better predicted. 

Let us denote + the normal direction of unzipping of the molecule: the i*^ base (along the 5' 3' strand of 
molecule) in this direction is simply hi. The free energy to open the first n bases of the molecule is G'^{n, /; B), equal 
to G defined in ([T]) . In the reverse direction, denoted by — , we denote by b~ the base along the 5' 3' direction: 
= complibN+i-i) where compl(6) denotes the complementary base of b. The free energy to open the first n > 
bases of the molecule in the — direction is 



n-l 



,9o[ 



-n 9s{f) 



N 

E 

i=N-n+l 



go{bN~^, bN-^+l)-n gsif) = -G+{N-n, /; B)+G{N, /; B) (77) 



where we have used the symmetry go{b,b') — (7o(compl(6'), compl(&)) of the go interaction matrix (Table |T| [4^ . 
Therefore, up to an irrelevant additive constant, the free energy to open n bp in the — direction is simply the opposite 
of the free energy to open TV — n bp in the + direction. 

If we unzip R times the molecule in the + direction the error in predicting base i will decay exponentially with 
R with a decay constant equal to R^{f, i) given by eqn ([75]) . We may instead open R times the molecule in the — 
direction, and infer the value of base i (labeled iV + 1 — i in the — nomenclature) . The probability of a mistake is 
again an exponentially decreasing function of R with decay constant R^{f,i) (j73p . calculated from the number of 
openings of base i in the — direction f Appendix IG 2[) . 

Assume now that the unzip R/2 times the molecule in the + direction and R/2 times in the — direction. We 
show in Appendix IG 21 that the probability of predicting that the bases attached to the bond i,i + 1 are b, b' decays 
exponentially with R with a decay constant equal to 
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(78) 



We have taken into account the effects of stacking interactions between nearest neighbor base pairs as done in 
Section llV CI The decay constant of the error in the two-way unzipping at force /, (/,«), is obtained using 

recurrence eqn ([76]) upon substitution of Rc{f,i,bfbf_^^ — > bb') with R^^~ {fjijbfbf^i bb'). The results for 
R^^~{f, i) are shown in Fig [51] A comparison with Fig [^D] shows that the number of unzippings necessary for a good 
prediction greatly decreases with the two-way unzipping procedure with respect to the one-way unzipping (for the 
same amount of collected data). 



V. TOWARDS MORE REALISTIC DATA MODELING 



A. Finite-bandwidth inference 



So far we have assumed that the temporal resolution was infinite. A time-trace contains a perfect information on 
the opening dynamics i.e. on the motion of the fork (set of numbers Ui ,di) and on the sojourn times ti for every 
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base i of the chain. Real experiments obviously do not have such a perfect sensitivity: actual feedback systems and 
detectors are limited to delays between measures of about At ^ 0.1 — 1 ms. This temporal resolution is a major 
limitation: during the delay At the fork can explore up to 100-1000 bases around the starting position, depending 
on the local structure of the free energy landscape. The true dynamics of the fork is therefore unknown and the 
prediction algorithm has to consider all the trajectories of the fork (in a ~100 bp window). This problem is studied 
in detail in [42]. Hereafter we limit ourselves to the case of a finite but very large bandwidth where the delay At 
between two measures is of the order of the opening time of a bp (and not much smaller as considered so far). 



Typical jump between two measures 



Rates (O define the non zero (off diagonal) elements of the elementary transitions matrix 

Hi' A = ro{i) ■ Si'^i+i + rdf) ■ S^j^i - {ro{i) + rdf)) ■ S^^i 
The evolution operator after a time At is given by the matrix exponential 

U = exp [At H] 



(79) 



(80) 



The entry Ui/^i represents the probability of going from base i to base i' in the time interval At. In principle all 
transitions are allowed and U is therefore a N x N matrix. In practice jumps j — i' — i are unlikely to exceed (in 
absolute value) the ratio At/r where r is the typical time to open a bp. The probability distribution of jumps j, 
averaged over the starting base i, is shown in Fig [22] for / = 16.4 and 17.4 pN, and At ranging between lO^^s and 
10~'^s. As the force and the sampling interval increases the distribution gradually spreads over larger jump values, and 
long tails appear. Nevertheless, long jumps seem to be rare events, restricted to particular regions of the landscape. 
Most of the information on the opening dynamics can therefore be kept when discarding displacements larger than 
some threshold J e.g. J = 10 in Fig[22| To do so, given the starting base i, we construct a reduced (2J+ 1) x (2J+ 1) 
matrix H'^-^'^^ as follows, 



/ -ro(i - J) - rc 
ro{i ~ J) 
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J+l)-r, 
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-ro{i - J + 2) - Tc 
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(81) 



ro{i + J-l) -ro{i + J)-rc/ 



and the associated evolution operator = exp [At //f'^^*)] , which encodes all the jumps from base i of amplitude 

less or equal to J. There are 

4(2j+i) different U'--^^ matrices, one for each possible choice of the 2 J + 1 bases involved. 



2. Extended Viterhi algorithm 

Given a sequence B for the molecule the probability of a time-trace T (where the number of open bp is measured 
at times multiple of At) is given by a product of 4'' x A'' transfer matrices 

V^'^ {T\B) = W M(J^'\h, . . . h+j) (82) 

i 

with 

M(^^^)(6.,...,6.+,) = f[0^^;i)'^' iUm'')'^" (83) 

and fcp'' is the number of transitions i — > i + j, with j — —J,—J+ 1, . . . , J — 1, J in T. Notice that kf^\ k^^'' and 
fcj- coincide with ti/ At, Ui and di respectively. 

An extended Viterbi algorithm allows us to find the most probable sequence. We now have to consider the probability 
of a sequence of J contiguous base, starting from i, and write a recursion equation for this probability, 

P!+l{h+i, . . . ,5,+j_i) = max [m(^'^)(6„ . . . ,6,+j) x Py\b,, . . . ,6.+j-i)] , (84) 
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jump jump 

FIG. 22: Probability distribution of a j-base jump for At between 10~^s and lO'^^s and for forces / — 16.4 and 17.4 pN. 
Notice that the probability is not necessarily a monotonously decreasing function of see extra humps in the right column, 
due to sequence effects. 



which extends eqn ([5]) to J > 2. For the first base i = 1 the optimization is simply 

P^'^ (62, ... , bj+i) = maxM(^'2) (5^^ fo^, . . . , (85) 
fci 

The optimal choice for bi depends on the J next base values, bl = &™''^(^2, . ■ . , bj+i). Then we find the next base, 
&2 as a function of 63, ... , bj+2 through ([84]), and so on, until the last base of the chain is reached. Its most probable 
value is selected and the whole optimal sequence is recursively reconstructed from the fo™'*^ functions. 



3. Numerical study 

We first generate a set of numerical data by recording the MC output (fork position) at discrete times multiple 

of a sampling interval At; intermediate states are simply ignored as the instrument does not have the resolution 

(7) 

to appreciate them. Then we preprocess this partial time-trace to obtain the transition number k[ , and make a 
prediction for the sequence using the above extended Viterbi algorithm. 

Figure shows the quality of prediction as a function of the delay At at fixed range J — 2, 3, 4, 6 and for a 
single unzipping {R = 1). Data shows that, for a given range J, there exists a threshold value for At above which the 
maximum displacement permitted becomes too small to properly describe the unzipping dynamics. The information 
collected is no longer sufficient for a reliable prediction and the error e rapidly increases (see Fig As expected 

the threshold At increases with the range, meaning that larger ranges are better suited to deal with longer sampling 
intervals. When At is small, comparable with the elementary sojourn time on a base (r ~ 1/is for a weak base), the 
performances are equivalent to the one of the J = 1 case. 

The relationship between the range J and the largest delay At it can sustain is better seen on the case of uniform 
sequences. The characteristic sojourn time on a base, (t) p9p . is then uniform throughout the sequence e.g. (t) ~ 1 /Lts 
for a repeated sequence of W bases. Fig [23l 3 shows that the prediction is perfect up to a temporal resolution 
At ~ J X (t), where (t) is the characteristic sojourn time on a base pair, and J is the range of the algorithm. The 
existence of a threshold for the delay is clearer at high R than for R = 1 (Fig ESK) due to the presence of larger 
fluctuations in the sojourn time in the latter case. 

Figure [24l (left) shows that the quality of the prediction betters when the information from several opening experi- 
ments is collected. As long as the typical jump associated to a delay At is smaller than the range J (Fig [511) the error 
e can be reduced and values of order 10~^ are reached after 50 unzippings for the A-phage sequence at force / = 16.4 
pN . Once the threshold At is crossed, however, the loss of information can not be 'repaired' and repetitions of the 
experiment appear to be useless. The fork has moved too far away during the delay At and a lot of information falls 
out the window of size J our algorithm is based on, an effect which cannot be compensated with multiple experiments. 
The effect is qualitatively similar for the weak/strong (AT/GC) distinction shown in Figure[Ml but is somewhat less 
dramatic from a quantitative point of view. 
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FIG. 23: Error e as a function of the delay At between measures for various ranges (shown on Figure). A. Case of one unzipping 
{R = 1) of a A-phage DNA molecule at / = 16.4 pN. B. Case of _R = 20 unzippings of a uniform sequence of weak bases at 
/ = 11.8 pN. Results are averaged over 50 samples in both panels. 



B. Fluctuations of the unzipped DNA strands 

Real experiments give access to the extension x of the open DNA (ssDNA) strands, and not to the number i of 
open bp (Fig. [T]). Due to the intrinsic elasticity of the strands x fluctuates even at fixed i, and these fluctuations grow 
with i. Indeed a strand is made of i monomers, each acting as a spring with stiffness constant K ~ 170 pN/nm at 
/ = 16 pN and room temperature (2^ . The distribution A(x|i) of the extension x for a given i is roughly Gaussian, 
with mean ixo where xq = dgss/df ~ .9 nm is twice the average extension of a ssDNA monomer, and standard 
deviation ^/iSx where Sx — y^2 ks T/K ~ .2 nm (Fig [25]). Distribution A could be precisely measured through a 
combination of optical trap and single- molecule fluorescence techniques [21]. 

1. Effect of ssDNA fluctuations on the Bayesian inference 

We hereafter study the effects of these fluctuations on the inference problem in the absence of stacking interactions 
and at high force. We start by making more precise the notion of the time spent on a base: 

• the real time t\: this is the time really spent by the fork on bp i, simply denoted by ti so far. This number 
is stochastic since the fork undergoes a random walk motion, with a distribution depending on the nature of 
base i (fT8|) . The absence of stacking ensures that real times attached to distinct bases are uncorrelated; the 
probability of the set of real times T'" = {i*} given a sequence B is, up to a sequence-independent multiplicative 
factor, 



i 



(86) 
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FIG. 24: Left: Fraction of mispredicted bases e as a function of the number of unzippings for different temporal resolutions At. 
The value of the range is J = 4. Right: same as right but we only discriminate among strong and weak bases. Data refer to 
the opening of a A-phage sequence at f=16.4 pN and they are averaged over 50 samples. 
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FIG. 25: Distribution A{x\i) of the extension x of the open ssDNA at fixed position of the opening fork, i — 1 and i — 10. The 
r.m.s. of the distribution (at a force of 16 pN) increases as The apparent value of the number of opened bases corresponding 
to a given x, (f88|l. is shown on the top axis. 



which corresponds to (|5l6p in the hmiting case of high force and no stacking. Given a set of real times the best 
sequence B*[T^') is the one maximizing V{T'^\B). The probabihty of predicting sequence B is, given the true 
sequence _B^, 

Q:{B) = I dT"- V(T''\B^) W e{V{T''\B)-V{T''\B')) (87) 

B'(^B) 

where 9 is the Heaviside function, 6{x) = 1 if x > 0, otherwise. In practice, however, one has no access to the 
real times. 

• the apparent time tf: Given a measure for the extension x of the ssDNA we define the apparent position of the 
fork through 

X 

i° = Closest integer to — . (88) 

Xo 

The value of is stochastic, with a probability A depending on the real position of the fork, i'^. Considering 
Rouse dynamics for the monomers [41| the longest relaxation time of a strand is, denoting the viscosity of the 
solvent by tr{n) ~ (/{Ktt'^) x (2n)^ ~ 100 ps. For molecules with < 100 bp ssDNA reaches equilibrium 
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faster than the fork moves. The probability to observe > 1 at some instant thus depends only on the true 
position of the fork at the same time, and reads, when i'' > 1, 



dv 



exp 



(89) 



with (T^ = Sx/xq] the expression for = is obtained from (j89p upon replacement of the lower integration limit 
with —oo. When the molecule is entirely closed {i^ ~ 0) all values of have zero probability except i"^ — 
{A{0\0) = 1); this choice amounts to neglect the fluctuations in the extension of the DNA linkers. 

We call tf the time apparently spent by the fork on bp i, that is, the number of measures in a time-trace in 
which the fork appears to be at location i according to ([55]) . divided by the delay At between two measures. 
Matrix A ((89)) implicitly define the probability distribution of a set of apparent times = {tf} given a set 
of real times, see Appendix |H] for more details. Multiplicating by and integrating over the real times 
formally defines the probability 7"'(r°|i3) of a set of apparent times given a sequence B. Given an apparent 
signal T" the best sequence B*{T°-) is the one maximizing V°'{T°'\B). The probability of predicting sequence 
B is, given the true sequence B^, 



Q^iB) ^ f dT" ViT^lB^) Yl 0{V''{T''\B)-V''{T''\B')) 



(90) 



Consider first the ideal case where the delay At between successive measures is vanishingly small. In this limit, 
given the set of real times, the apparent times tf are not stochastic but simply obtained through the convolution of 
the t"s with matrix A ([89]) : = ^4 • T'' in vectorial notation. Starting from the probability ([90]) of predicting a 
sequence from the apparent times and performing the change of variable T'' = A^^ ■ we obtain Q°'{B) = QJ'{B) 
((87)) . The probability, within Bayes framework, of predicting the true sequence B^ is the same as in the absence 
of fluctuations. In particular the values for Rc calculated in the previous Section are unaffected by the presence of 
ssDNA elasticity. 

This result does not hold for finite delays At where, given a set T*" of real times, the apparent times tf are stochastic 
due to the finite number of samplings during the sojourn time on each base. Let us assume that the delay At between 
successive measures is small with respect to the sojourn time (t) on a base pair but non zero. The Bayesian probability 
Q°(-B) of a sequence now depends on the fluctuation matrix A. For the sake of simplicity we consider only the case 
of a large number of unzippings, and a repeated sequence of bases S with a unique W base at location i. Let 



p ^^^g re 



(91) 



denote the ratio of the delay over the average time spent on a S" base; by hypothesis p <C 1. The probability that the 
W base is not correctly predicted reads f Appendix (H)) . 



^R,^ = Q^iB^) - e'^/^"^'^ where Rdi) 



A2 {ATp-^A)^ 



(3: 



(1-p) [AA^ 



)3,k 



pid 



(92) 



and A^ denotes the transposed matrix of A. The above formula holds for a small difference A of free energies between 
the weak and strong bases, see ((32)). The outcome for Rc{i) is shown in Fig[26K for p — 0.1 and grows as the square 
root of i 34 1 . More precisely we find Rc{i) oc a \/i where a = y/SxJxQ, and the proportionality factor depends on p. 



Perfect prediction is still possible, but at the price of a number of unzippings growing with the base index. 



2. Sequence prediction through deconvolution 

The above results do not tell us how to make a prediction for the sequence given an apparent signal T". The 
expression for is highly non local: the probability of the time tf does not depend on the type 6^ of base at location 
i but also on its neighbors. A practical procedure consists in calculating, once the apparent times T° are measured, 
the set of deconvoluted times T'^ = {tf} through the formula 



3 



(93) 
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FIG. 26: Value of the number of unzippings controlling the decay of the error in predicting a base, Rc{i), as a function of the 
bcise index i. The sequence is made of bases S with a single W base at position i. The dotted line shows the value of Rc in the 
absence of ssDNA fluctuation, for a difference of free energy between 5* and W bases equal to A = 0.5. A. Case At = {t)/10. 
The decay constant Rc{i) for the Bayesian error (|92p grows as y/i (dashed line). B. Case At — > 0. The full line shows Rc{i) 
for the Viterbi procedure without deconvolution; for i >7 Rc{i) is infinite, meaning that the W base is almost surely predicted 
to be of S type. With appropriate deconvolution the dotted line value for Rc is recovered. 



where D is an appropriate deconvolution kernel to be specified later. Ideally, after deconvolution, the probability of 
T'^ given the sequence B should coincide with the local probability (j86p . The prediction for the sequence is then done 
through the maximization of V ([5^ over B, given the set T'^ of deconvoluted times. 

We start by showing how the performances of the inference procedure are dramatically worsened by fluctuations if 
no deconvolution is performed {D = Id), and then show how the effects of fluctuations are cured when deconvolution 
is performed. We focus here on the cases R — 1 and 1 only, and concentrate on the case At ^ first. Consider 
the base at location i, which wc suppose to be, say, of type W . The error in predicting this base reads, see Appendix 

El 



w 



E n ( 1 - §7 ) ' (94) 



where 



= exp(go(fe,) - go{bj)) {D A),^ , (95) 

and T^,T'^ are defined in (|23p . The subscript 1 refers to the value i? = 1 of the number of unzippings. Figure [771 
shows 1 — ej^j as a function of y/i for a repeated sequence SSSS . . ., and for an alternate sequence SWSW ... in the 
absence of deconvolution {D = Id) . The error increases from a value for i = 1 essentially equal to its counterpart 
([M)) in the absence of strand fiuctuation, to reach unity at large i. This behavior is easily interpreted: in the absence 
of deconvolution the apparent time t" (more precisely, the reduced time rf ((20l) ) on base i is the sum of the real 
times spent on each base j, weighted with the probability Cij ([95]) . As i grows more and more bases j contribute 
to the sum with smaller and smaller weights, with a number of contributing terms scaling as Vi. The law of large 
numbers tells us that the distribution of r° is asymptotically concentrated around &. single va,lue, equa-l to — c 
and to = i(l + e^) for the SSSWSSS.. . (where the unique W base is located at position i) and SWSW . . . 
sequences respectively. As these values exceed (|23p the base is almost never correctly predicted 45] . The very 
tiny probability of success is due to the tail of the times below , which decreases exponentially with (Fig [27|) . 
In the limit of a large number R of unzippings the error decreases as (Appendix [H| 



if E^C,, <l + # 



ei^.^ ~ e-^/^=« where Rc{i) = { (i+4-E, c..)' ^ . (96) 

>1 + T 



+00 if > 1 ' 



The above expression was derived when the free energy difference A between W and S bases is small, the hardest case 
from the inference point of view. In the absence of fluctuation A = D = Id we find back result PT|) as expected. Notice 
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FIG. 27: Probability that a base is correctly predicted, 1 — , as a function of its location i in the case of: a repeated sequence 
of 5* bases with a single W base at position i (black dots), an alternate sequence SW SW . . . (empty dots). In both cases the 
rate of success decreases exponentially with the square root of i. The difference of free energies between S and W bases is 
A = 2.8. 



Rc = oo simply means that the error does not converge to zero when R increases. An illustration of this situation is 
given in Fig[^5K. The number Rc{i) of unzippings necessary to correctly predict a unique W base located at position 
i inside a repeated SSSS . . . sequence increases with i, and diverges for i > 7 in the absence of deconvolution. The 
reason for this failure is the same as in the above R = 1 case: the apparent time on base i is corrupted by too many 
S bases and the true nature of the base cannot be recognized. 

Fortunately the situation drastically improves when the signal is deconvoluted with the kernel 



equal to the pseudo-inverse of matrix A. We have not encountered any numerical problem to calculate this pseudo- 
inverse from the inverse of A for sequences with a few hundred bases. The matrix C in (j95p then reduces to the 
identity matrix, and the errors for a single (j94p and a large number (|96p of unzippings decrease to their respective 
values in the absence of fluctuations. In particular the number of unzippings necessary to correctly predict a base is 
simply Rc ^ independently of i. As a conclusion, through an adequate and sequence-independent deconvolution 

procedure, we have been able to completely remove the effect of ssDNA fluctuations. 

In the case of a finite delay At we expect that an appropriate deconvolution with the kernel (j97p is sufficient to 
correctly infer the sequence with the extended Viterbi algorithm of Section IV Al [A2»] . 



In this paper we have studied the inference of a DNA sequence from Monte-Carlo generated unzipping signals. 
Inference is made uneasy by the fact that unzipping signals are largely affected by thermal noise, due to the fact that 
the free energy to open a base pair (the loss in binding free energy plus the work to stretch the unpaired DNA strands) 
are of the order of k^T. The main goal of the present work was precisely to reach a theoretical understanding of how 
to cope with thermal noise in the inference process. 

The present study is in part numerical and in part analytical. From the numerical side we have first generated, 
from a given sequence, unzipping data by a Monte Carlo algorithm based on a previously introduced dynamical model 
of the unzipping [28| . We have then implemented algorithms to reconstruct the most probable sequence from the 
unzipping signal. The prediction error on each base can be simply evaluated through the comparison between the 
true and the predicted sequences. From a theoretical side we have calculated the error (probability of misprediction) 
with the aim to understand its dependence on the sequence, the intrinsic parameters i.e. the biochemical base pair 
free energies, and the extrinsic parameters i.e. the unzipping force, the number of repetitions of the unzipping, the 
collection of unzippings from both sides of the molecule, .... Numerical results compare very well with analytical 



D = A'< 



(97) 



VI. SUMMARY AND CONCLUSION 
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calculations. Our main analytical finding is that the average prediction error on a base i decreases exponentially with 
the number R of unzippings. The decay constant Rc{i) gives the number of unzippings required to achieve an excellent 
prediction of the base. We have analytically calculated the value of Rc in the following cases: (high force) repeated 
sequences without pOI31l) and with ([55)1 stacking interactions, heterogeneous sequences ([M]): (moderate force) with 
(|70l7ip and without stacking interactions (|76p . for two-way unzippings (|78p . and taking into account the fluctuations 
of the extension of the unzipped strands (|92|96|) . 

We have first considered the ideal case in which it is possible to follow directly the dynamics of the opening fork 
with a perfect temporal resolution; in this limit all base pair opening and closing events are detected. The only 
source for stochasticity is the thermal motion of the fork. In the absence of stacking interaction the decay constant 
Rc{f, i) for the base i and at a force / can be obtained, in this case, as the ratio of the decay constant at large force, 
Rc{f — oo,i), over the average number of openings of the base during a single unzipping, (uj). The average number 
of openings of a base, (wi), depends on the free energy landscape of the molecule, determined by the force and the 
sequence content, and was computed in Appendix iFl In the presence of stacking interactions Rc{f,i) depends on the 
whole sequence and was calculated through an asymptotic version of the Viterbi algorithm (Section IIVD 3p . Base 
pairs exhibit a lock- in phenomenon : there exist blocks of neighbouring bases with the same decay constant Rc{f,i), 
while bases in different blocks have much weaker correlations. We also show that much better predictions on the value 
of a base can be obtained from the same amount of collected data if the molecule is unzipped from both extremities 
rather than from one extremity (as done so far). 

The assumption of infinite temporal bandwidth and precise knowledge of the fork position dynamics allows us to 
start from the simplest case for the sequence prediction analysis. The advantage is that Bayesian inference can be done 
exactly with a fast procedure, the so-called Viterbi algorithm. The most likely sequence, given a measured unzipping 
signal, is found in a time scaling linearly with the number of the bases. The existence of a fast, exact algorithm 
allowed us to check analytical results; the latter are indeed always obtained for the optimal sequence, irrespectively 
of the existence of a practical algorithm capable of finding this sequence. 

In the second part of the paper we have made a step forward toward the analysis of real experimental data and have 
included in the inference analysis two major sources of instrumental limitations: the finite data acquisition bandwidth, 
and the elastic fluctuations of the unzipped DNA strands. 

The finite resolution in time is such that during the time interval between two data acquisitions the opening fork can 
move by (much) more than one base. The exact Viterbi algorithm has been generalized to the case of a large but finite 
bandwidth, by considering all the forward and backward transitions of the opening fork which can take place, within 
a range J, during the time interval At between two measures. This new algorithm is able to reconstruct the sequence 
when the range J is of the order of the ratio between At and the typical sojourn time (t) on a base pair. Though 
our extended Viterbi algorithms still runs in a time growing linearly with the number of bases, it is exponential in 
the range J, and is limited in practice to J < 10. This algorithm is thus implcmcntable for At ~ 10 (t), i.e. up to 
about 10 fj,s. In other word the bandwidth frequency should be larger than 100 KHz, a larger value than the current 
value for the bandwidth in real experiments of the order of 1-10 KHz. Other algorithms presumably not guaranteed 
to reach the most likely sequence, but with a running time polynomial in the range J, should be implemented. 

In addition we have considered the effects of the fluctuations in the extension of the DNA strands. Indeed, even 
if the distance between the extremities of the unzipped strands is typically known within < 1 nm accuracy [j, [lo| , 
thermal fluctuations in the strand length (and possibly in the linkers) are responsible for a larger uncertainty over 
the position of the opening fork. We have, in particular, extended our theoretical formalism to calculate the decay 
constant of the error with the number of unzippings Rc at high force, without stacking, in presence of DNA strand 
fluctuations and with an interval At between two measures finite but small with respect to the sojourn time (i). We 
have obtained that the decay constant Rc for the error on base i is multiplied by ^/i with respect to its counterpart in 
the absence of DNA fluctuations. The further from the beginning of the sequence a base is, the larger is the number 
of unzipping to reach a good prediction. 

The theoretical formalism for At ^ suggests a way to preprocess the signal by deconvoluting it with the pseudo- 
inverse of the (sequence-independent) DNA fluctuation matrix ([89|) . This signal can then be processed with the usual 
Viterbi algorithm, and the quality of the prediction is the same as in the absence of strand fluctuations. A natural 
question is whether the same deconvolution procedure could be applied to the realistic case of a flnite bandwidth 
or not. We are currently working on this problem, and are developing a formalism for the calculation of Rc in the 
presence of DNA strand fluctuations and for experimental value of Ai ~ 0.1 ms |42]. The design of efficient inference 
algorithms in this realistic case is a challenging issue. 

An implicit but not well justified assumption we have so far is to have a perfect knowledge of the pairing free 
energies and dynamics of unzipping i.e. of the conditional probability P{T\B). In practice, however, modeling cannot 
be perfect and any functional form for P(T\B) will be only approximate for a given experimental setup. Numerical 
investigations show, not surprisingly, that the quality of prediction deteriorate when the rates used by the Viterbi 
procedure differ too much from their values in the data generating Monte Carlo procedure. A possible way out should 
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be based on a learning principle: in a first stage unzipping data corresponding to a known sequence (A-phage) are 
collected to caliber the rates, in a second stage predictions are made for new sequences. Last of all we have here 
considered unzipping at constant force. Investigation of the constant velocity case would be very interesting. 
Local minima are well predicted and remarkably the force signal may be affected by the substitution of one base pair 

Let us finally mention a related albeit more complex problem, the analysis of RNA unzipping data. The non 
complementarity of single strands in RNA molecules give rise to complex folded secondary structures with multiple 
helices. Gerland and collaborators have suggested a way to reconstruct RNA secondary structure by combining the 
recording of the force-extension curve and the passage through a nanopore [29|. The passage through the nanopore 
would indeed force to the helices to open one after the other with a sequence-specific order. In this respect, thanks to 
the nanopore geometry, the RNA unzipping problem is reduced to a unidimensional problem for which the inference 
methods presented here could be of interest. 
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[43] A fit of tfie slope of the curve in figure [TOl gives R — 100 ± 1 while a better fit i? = 113 ± 2 is obtained by talcing into 

account the multiplicative 1/y/R term in the error in formula (|30|l . 
[44] To define properly the change in the free energy G lU of the molecule when its last base i = A'' is opened we have added 

a fictitious i = N + 1 base; the contribution to the free energy is symbolized by Ag — go{bM, &jv+i). In practice Ag is not 

given by Table |I] but may have a more complicated origin. For instance the molecule may end with a loop, Ag will then 

be equal to the gain in entropy when the loop opens. 
[45] The same argument indicate that the probability to mispredict base bi = W base among a repeated WWWW . . . sequence 

vanishes when i tends to infinity. The reason is that the apparent time on base i converges to the average time on the 

neighbors which are all of the right type W. 

APPENDIX A: IMPLEMENTATION OF THE EXTENDED VITERBI ALGORITHM 

A time trace T of the unzipping signal, produced by the Monte Carlo procedure, is first encoded in a vector 
/C — {kl '^\k\ ''^'^^ . . .kf'\k\^^\ . . .k!f^} where kf^ is the number of transitions i ^ i + j . J fixes a cutoff on 
the displacement taken into account: only jumps by jjj < J bases are considered. The information on the opening 
dynamics i.e. the vector /C, the applied force / and the temporal resolution At is used to construct the transfer matrix 
M^-^'') §^ for the i*** base. 

The matrix exponentiation, needed to compute U ((80|) . is carried out by solving the set of 2J + 1 coupled differential 
equations 

j' 

where j — — J, . . . , J, and is defined in ([STjl . The initial conditions read 

2/° = 1 

2/° - j^i. (A2) 

The value of yj at time At is the matrix element Ui^'j \ of the truncated evolution operator. The operation is repeated 
for the various values of the starting base index i to obtain the whole operator. From a numerical point of view we 
solve (jAll) using a classical 4*'* order Runge-Kutta method for integration of ordinary differential equations. 

Once the matrix C/'^'''*-' is computed, the transfer matrix M'"^'*) can be easily evaluated knowing the unzipping 
dynamics i.e. the vector /C. The probability of a sequence B given the unzipping signal T is then maximized via a 
transfer-matrix-like algorithm. To avoid errors due to small numbers we apply the recursive procedure (|84p to the 
logarithm of the probability instead of the probability itself. The general optimization step is therefore 

\uP^^^l{b^+i,h+2, = max [ln/^('^^(&„ . . . , 6,+j_i) + lnM('^'')(&„ . . . , h+j)] 

At each step, the type of the i*'' base that maximizes lnPj'''^'*\ &*, is stored for each of the 4"^ possible choices of 
following J bases (&i+i,6i+2, ■ • ■ ,bi+j)- ^'^ possible sequences are thus constructed and kept in memory. When the 
algorithm reaches the end of the sequence the maximization over the last base type selects the best sequence and all 
previous bases can be simply reconstructed from the 6*, going backwards from the last base to the first one. 

Some problems in memory allocation and state labeling must be faced. The dimension of each vector 
6*(&i+i, 6i+2, • • ■ , grows as 4'' and there are N (up to 48,502 for a A-phage DNA) different vectors. When 

the range J is large, the memory space required to store this information becomes huge. To circumvent this problem 
we have reduced the space complexity of the algorithm by increasing its time complexity. To do so we apply the 
algorithm more times, memorizing, and reconstructing, only a portion of length D of the sequence during each exe- 
cution. During the first execution only the last D bases of the sequence are reconstructed. In the second execution 
the algorithm stops at base N — D, where N is the total number of open base pairs, and another set of D bases 
are predicted. This procedure goes on until the first base of the molecule is reached. D is of course an adjustable 
parameter and the number of times the algorithm is repeated is chosen consequently. 

Our code is written in a range independent way. The user simply sets J at the beginning of the program, without 
changing anything else. The 4'^^"'+^) choices of the variables (&i_,/, . . . h+i, 6i+2, ■ ■ ■ , h+j) that define a specific recon- 
struction 'state' are represented by a bit string whose length depends on the fixed range J. The string is assigned 
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i + J 


l + J-1 




i-J + 1 


I- J 


s 


Sequence 


00 


00 




00 


00 





AA . . . AA 


00 


00 




00 


01 


1 


AA . . . AT 


00 


00 




00 


10 


2 


AA . . . AC 


00 


00 




00 


11 


3 


AA . . . AG 


00 


00 




01 


00 


4 


AA . . . TA 


00 


00 




01 


01 


5 


AA . . . TT 


00 


00 




01 


10 


6 


AA . . . TC 

















TABLE IV: Table of variable labeling for a set of (2 J + 1) bases. Each sequence is identified by a label s in its binary writing: 
2 bits identify the type assigned to each base, the lower bit being corresponding to the base with the lower index along the 
sequence. 



in the following way: 2 bits identify the type selected for a base, the lower bits referring to the base with the lower 
index along the chain, see Table [X] The binary number s encoding a string of 2 J + 1 bases is called its label. 

The largest range we could test is J ~ 10. Like the memory cost, the execution time of the program scales linearly 
with N but exponentially with the range J. The time needed to perform a single unzipping (without considering 
the statistics over samples) increases as uj^k x 4'' x (2J + 1)^, where uj^k is the number of integration steps in the 
Runge-Kutta subroutine. 

APPENDIX B: CONVOLUTION PRODUCTS FOR R UNZIPPINGS 
1. Distribution of the sojourn time 

The distribution of the total sojourn time r (|26p spent on a base for R unzippings is defined as 

/"OO />oc />oo / \ 

PR{r)=j drWFi(rW)y dr^'^ P^ir^'^) . . . J dr(«)Fi(r(«)) <5(^r - (rf + + . . . + ') j (Bl) 
where Pi is defined in eqn (j21[) . Taking the Laplace transform, we obtain 

dTPn{T)e-'^ =U^ drPi(T)e-^M =(1 + 5)"^. 



(B2) 



It is a simple check that this expression coincides with the Laplace transform of the right hand side of eqn (|27p , hence 
proving identity ([77]) for Pr. 



2. Distribution of the number of fictitious unzippings 

To calculate the i?*'* power (for the convolution product) of pi ([66]) we introduce the generating function 



l-{l~E,)x 



(B3) 



37 



Thus pr{u) is the coefficient of in the above rightmost expression. It is convenient to define 

K.) = gp.(.).-^ = ( ^_^^;_^J (B4) 

We then obtain expression ([68|l from the identity 



1 d'^-^g 



{u-Ry. dx"^-^ 



(B5) 

x=0 



APPENDIX C: STATIONARY DISTRIBUTION OF LOGLIKELIHOOD FIELDS 

Assume that the sequence is repeated; hence we can drop the base index i in the definition of function Fi (|43p and 
in the distribution Qi of the loghkehhood. We rewrite eqn under the form 



Q{h') ^ / dhTR{h',h)Q{h) , (CI) 

J —OO 

where the kernel Tr is defined through 

rOO 

TR{h', h)= dT Pr{t) 5{h' - F{h, t)) . (C2) 
Jo 



X [e^ — 1) X [1 — e ^ ) 



In addition we define 



where we have used definition (j48p for parameter x. We now rewrite 

F{h,T)^-h + x{e'^"'' -1) max(Ti(/i) - -,0) + a;(l - e"'^'') min(r2(/i) - -, 0) (C4) 

R R 

The value of above function of r depends on the relative values of ti and T2. Let us make the hypothesis {HI) : 
e^^ + e~^^ > 2. Then, ti < T2 if and only if h > ho with 

AV^(l-e-A-)-A^(eA--l) 



Assume in addition that {H2) : A^ < A-^. Then 



We obtain from (|C4jl . 



A^-j^x{e^ -1) if /i>-A^andT<ri(/i) 
-F(/i,t) = <( -/i if /i > -A^ and Ti(/i) < r < r2(/i) (C7) 

A^^- ^2;(l-e-^^) if T>T2(/t) 

and the following expression for the kernel Tr (IC2[) . 

-Pfl,(-i?Ti(-/i'))/(^^)/(e^"' - 1) if h> -mm{h\A^) 

S{h' + h)x[j{R,Rmax{0,Ti{h)))~-f{R,RT2{h))] if /i > -A^^ 

' I Pr{~ RT2{-h'))/{Rx)/{l-e-'^') if h' <min(~h,A^) ^ ' 

if h' > A^ or A^ < -h <h' < A^ 
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where 7 is the incomplete Gamma function ([29]) and distribution is defined in (|27p . We then inject expression 
(jCSp for in the fixed point eqn (|Cip . and integrate both sides over /i' over the interval H < h' < 00. As a result 
we obtain the remarkably simple identity 

Q{H) = A(H) - B{H) Q{-H) (C9) 

where the cumulative distribution Q is defined in (|45p. and functions B in (|47p . 

From jCH]) (fourth line) (3(/i') vanishes when h' > . Hence Q{H) ^ for H > (third hue of dH])). Choose 
now < then Q(-i/) = and, from ^Q^i, Q{H) = A{H) (first line of (gg)). Then we iterate to obtain 

Q{H) = A{H) - B{H) [A{-H) - B{-H) Q(H)\ , (CIO) 

from which we extract the expression of Q{H) in the range — A'^ < H < A^ (second line of (|46p V It is easy to check 
that Q is a continuous function of its argument both in — A'^ and +A"^. Notice that hypothesis (H1,H2) hold for 
typical values of the binding free-energies. 

It is quite remarkable that an exact analytical expression for Q{h) is available for our model. Indeed the recurrence 
equation for the field distribution of most disordered one-dimensional cannot be solved in a closed form . Dyson 
noticed in his original paper [i^ that a case of solvable model is obtained when the site disorder (here, the time ti 
spent on each base) is exponentially distributed. The present study generalizes this observation to the case of the 
convolution of exponentials. 



APPENDIX D: CALCULATION OF THE ERROR e AND THE CORRELATION FUNCTION x'''" 

Assume the sequence is very long {N 3> 1), and consider the base at location i far away from the extremities 
(1 ^ j ^ N). Base i can be predicted to he b (= W or S), with probability 

i'/(6.) = exp(-i?4(6)) (Dl) 

depending on the stochastic set of times {ti} spent on the bases. We look for the distribution of the loglikelihoods of 
base i, 

Qt(/jt) ^ Probability [/i^ = irjiS) - irjiW)] (D2) 

where the probability is calculated over the sets of sojourn times {ti}. Notice that we do not expect to vary with 
j in the bulk of the repeated sequence (see calculation of the correlation function below). 

does not coincide with the distribution Q of fields used in the iteration equation (g^ . Indeed the latter merely 
expresses the dependence of the loglikelihood over base i + 1 upon the choice for base i, independently of what happens 
at site i + 2. In other words, eqn ([9]) is a propagation equation for the left-to-right likelihoods tt"*; the subscript 
has been omitted so far to lighten notations. The direction of propagation is arbitrary: it corresponds to the choice 
of running the Viterbi algorithm from the first to the last base, determining the value of this last base, and then 
deducing the values of all bases from the last one to the first one. Clearly, we could have decided to run the Viterbi 
procedure in the opposite direction. The recurrence equation for the right-to-left likelihoods n^f is straightforwardly 
established, and reads 

nl{h,)=mm{7r;i,ib,+i)-goih,h+i)+reS°^''-'^^'hjR) . (D3) 

bi + i 

When the binding energy matrix is symmetric, the above recursion can be rewritten as 

^Tr{b^+l)=mm{^T;+,{b,)~go{b,,b,+l)+reS°(>'^^b^+'hJR) , (D4) 

and is identical to the recurrence equation ^ for tt^. We deduce that the stationary probability distribution of 
right-to-left fields, = 7r^(S') — Tr^{W), is equal to the left-to-right field distribution Q. 

Obviously, the actual prediction for base i is the base bi maximizing (|Dip and depend on the bases located on 
both left and right sides, that is, on left-to-right and right-to-left likelihoods, 

P^{b,)^Pr(b^)xPr(b,) I.e. T:\{b,)^^rih)+<{b{) , (D5) 



39 



when taking the logarithm. Translating the above equation in terms of fields we obtain 

h\ = hr + hr . (D6) 

A symbolic representation of the above equality is proposed in Fig. The distribution of 'true' likelihoods is thus 
given by 

Qt(/jt) ^ j dh^Q{h^) J dh^Q{h^) (5(/it -h-^ - h^) . (D7) 
The error in predicting base i is therefore, 

dh^Q\h'^) = l- dhQ{h)Q{h) , - / dh'^Q\h^)= dhQ{h)Q{h) (D8) 

for repeated sequences of 14^ or S" bases respectively, see formulae (150151152^ . We have here used definition (|^5|) for 
the cumulative distribution Q of fields. 

A similar approach can be used to calculate the disconnected nearest neighbor correlation function x''*" (|49p . Assume 
for simplicity that the true sequence is a repeated sequence of S bases, and consider the two bases at locations i and 
i + 1. Call and h^-^ the left-to-right and right-to- left likelihoods incoming onto bases i and i + 1 respectively, see 
Fig. E5B. Let = 1 if base i is (correctly) predicted to be S, if the prediction is W. We define a similar variable, 
rii-i-i, attached to site i -I- 1. Finally call r the normalized sojourn time on base i with distribution (j27p . Given a pair 
of incoming likelihoods {hj* , h^^^) and the sojourn time r, the Bayesian prediction for {ni,ni^i) is 

{fii {h-^ , , r) , ni+iih-^ , /i^^^ , r)) = argmax (n, n' , /i^" ,hl^^, r) (D9) 

(n,n') 



where 
and 



S^^(n, n\ h, h', T)^h{l-n) + h'{l~ n') + f^{n, n') ~ - exp g'^'^{n, n') (DIO) 



f^in, n') = go{S, S){n n' - 1) + go{W, W) (1 - n) {I - n') + go{W, S) [n (1 - n') + n' (1 - n)] 

= n n' (A*^ - A^) + {n + n') A'^ + A'^ + A^ . (Dll) 

The correlation function between ni,ni^i is 



(n, n^+i) = y drPuir) j dh~' Q{h:[*) dh^^^ Qih^^) S„i{h^ Mi-^^,r)s ^n.+iCV.'iI^i,^),! (^^2) 

where Ja^f, = 1 if a = 6, otherwise. An inspection of (jPlOp shows that both bases are correctly predicted to S when 
h~* and h^^i are both smaller than —A'^ -I- J(e^^ — 1). Hence formula (jSTI) . Formulae ((50| and (|53p for repeated WW 
and SW sequences can be obtained along the same lines through substitution of (jPlOp and (jPlip with, respectively, 

E^^{n,n',h,h',T) = hn + h'n' + g^^{n,n')- -exp g^^{n,n') (D13) 

R 

= nn'(A'^-A^) + (n + n')A^-A'^-A^ , 

and 

S^^(n,n',/i,/i',r) = h {1 - n) + h' n' + f^{n,n') -^eyip f^{n,n') (D14) 
f^{n,n') = nn' {A^ - A^)-nA^ + n' . 



APPENDIX E: LARGE R ASYMPTOTIC 

A saddle-point calculation of the incomplete Gamma function gives the following large R asymptotic for z 7^ 1, 

exp I" — — 1 — In 2;)] 

jiR,Rz)^9il-z)+ — ^ (El) 

VznR [z — 1) 
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i+1 



B 



i+i 



FIG. 28: Symbolic representation of the calculation of the distribution of likelihood at site i (A), and the joint distribution 
at sites i and i + 1 (B). The left part of the sequence induces a left-to-right likelihood h~' on base i, while the right part 
contribution is (A) or (B). 



where 9 is the Heaviside function: 9{1 — z) = 1 if z < 1, if z > 1. AppUcation of this formula to the error ([28|) in 
the no-stacking case yields the large R scaling of e in (fSO]) . 

Consider now the case of stacking interactions between neighboring bases. We first calculate the cumulative distri- 
bution Q (|46|) of likelihoods in the R ^ oo limit, then derive finite R corrections. With definitions (|47l48p we obtain, 
in the infinite R limit. 



A{h) eih" - h) , B{h) e{h - h'^) - e{h - k^) (E2) 



where 



/i^ = A^-a;(e^ -1), = A' - x {l - e'^ ) . (E3) 



For repeated sequences of, respectively, bases W and S, we have x = e ^ and x = . It is a simple check 
that, whatever the value of b, and have the same sign (positive for the WW sequence, negative for the SS 
sequence). Thus the product B{h) B(—h) in (j46| vanishes. We find that the cumulative distribution Q{h) of fields is 
a step function. More precisely, 

Q{h) ^ 6{h - hl^) where = - 1 + er^"^ , = -A^ + 1 + e'^', (E4) 

from which we deduce that the error in predicting a base vanishes in the large R limit. The case of the alternate 
sequence SW is more complicated. Setting x ~ 1 in (jE3|) we have < and > 0. Using (|E2|) and (|46l) we 
merely obtain Q{h) ^ 1 for h < , Q{h) = for h > -h^' and 

Q{h) = l-Q{-h) {h^ <h<-h^) . (E5) 

Though (|E5p is not sufficient to characterize the likelihood distribution it allows us to calculate the error from (I52p . 
with the resuh ((FT)) . 

Let us now calculate the corrections to the infinite R limit. The calculation of the error e is similar for WW and 
SS sequences, and is thus reproduced below in the WW case only. Let us introduce 

"W- ^(f!,'t) ' /^W-™a-(0'^^!^) i^-e-^"^). (E6) 

Using the large R expansion (jEip for the functions A and B in (|47|l we obtain from (|46p the asymptotic expression 
for the cumulative distribution of loglikclihoods 

and, through differentiation with respect to h, 

Qih) = \[^ ^ exp [ - R{p{h) - 1 - Inm)] ■ (E8) 
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These expressions hold when P{h) < a{h). This condition happens to be fulfilled for the choice of parameters of 
Section flVBI and in the vicinity of /i = 0. From ([50)1 we have 



[ dhQ{h)[l-Qi-h)] 
exp[-i? (/3(0) - 1 -ln/3(0))] 



2n{l 



1 



exp 



i? In 1 - 



h 



(E9) 



The dominant contribution to the integral comes from the /i ~ region. Expanding the integrand to the second order 
in h and calculating the Gaussian integral we obtain expression ([M]) . 

Finally we consider the case of finite temperature prediction of Section (jIV A2p for a base h {S or W), in the absence 
of stacking. Let A be the difference of free-energy between the two base types, and r given by ([57)1 . Integrating ([55)1 
by part and performing the change of variable r = R{x + A)/{e'^ — 1), we obtain the following expression for the 
error, 



eniT = 1) 



dx 



DC 

dx 



R{x + A^) \ ' 
e^'-l )_ 
exp[-i?G(x)] 



(1 



1-{AS + x)/{e^' -1) 
where we have made use of (jEip to obtain (jElip from (jElOp . and have defined 



,AS 



1 



- 1 - In 



.AS 



1 



(ElO) 
(Ell) 

(E12) 



The maximal contribution to the integral comes from the a; = region, with G(0) = t — 1 — Inr. Defining x = Rx 
and expanding G around a; = to the first order, we obtain 



efl(T = 1) = 



^-_R(T-l-lnT) 



V2t^ (1 - r) 



dx 



(r-l-lm 



(1 



-x\2 



V2t:R (1 - r) sin(7rCT) 



(E13) 



where tr = |G'(0)| is given by (|57)l. 



APPENDIX F: CALCULATION OF THE ESCAPE PROBABILITY Ei 

In this appendix we calculate the escape probability Ei that the fork moves away from base pair i (never reaches 

it back) after its first visit. Assume the fork starts its motion from base j. We define pj*' as the probability that the 
fork will never reach position i at any future instant. This probability is larger than zero when i < j since the walk 
is transient. Given the bp index i the probabilities pj s fulfill the recursion relation 

+ (Fl) 
where, in analogy with definition (|65p for a repeated sequence, 

e9Af) 

* ^ e9s{f) + e9oi>>j,b,+i) (^2) 



is the probability that the next base visited by the fork in j is j — 1. Equation (jFip is complemented by the boundary 
p[^^ = and p^^ = 1. Mathematically speaking the probability of not reaching i from N is not equal to unity since 
a random walk on a finite sequence is recurrent. However this approximation is quantitatively excellent for the long 
sequences considered here. Defining 

p)+i 
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1-Q. 



A 



B 



^' Q, 



c 



FIG. 29: Patterns A, B, and C present in the transition trace around base pair i. See text for definition. 



we obtain the Riccati recursion relation 



Ef^O- ^^1-7^-^^ for,>^. (F4) 

We have solved equation (|F4p numerically for the A-phage sequence. The escape probability from i is then obtained 
from lEll) and (HZ]), 

= ^ = n ^ • (F5) 

Pi+1 3>i+l 

APPENDIX G: AVERAGE ERROR e AT FINITE FORCE 
1. Case of one-way unzippings 

In this appendix we calculate the average prediction error after one unzipping over the distribution of the unzipping 
time traces. For a given time-trace, the prediction error depends only on the observed set {ti, m, di} of times ti spent 
on each base, and numbers of opening (ui) or closing (di) of each base. To make the average we have to calculate 
the distribution Pi{{ti,Ui,di}) of such sets on all the time traces. Pi{{ti,Ui,di}) is therefore the product of the 
probability to observe a set of {t^, Ui^di} in a given time trace (given in equation [5]) time the multiplicity of such a 
set {ti,Ui,di} on all the possible time traces. 

Let us start by calculate the distribution Pi{ui, di) ignoring for a while the time ti spent on this base. Let us focus 
on base i; the sequence of opening and closing transitions around this base, hereafter referred to as transition trace, 
can be decomposed into three kinds of elementary patterns schematized in Fig[29l and labeled with letters A, B and 
C: 

• Pattern A (Fig I29I \-) corresponds to staying on base i for some time, moving forward {i i + 1, probability 
1 — Qi), then coming back to i after a random walk throughout the upper part of the sequence (i + 1 . . . N) with 
probability 1 — Ei. The probability of pattern A is thus Pa — {1 — qi) x {1 — Ei) . 

• Pattern B (Fig I29B ) corresponds to staying on base i for some time, moving backward (i — > i — 1, probability 
Qi), then coming back to i after a random walk throughout the lower part of the sequence (j + 1 . . . N) with 
probability 1. The probability of pattern A is thus Pb = qi- 

• Finally, pattern C (Fig [^IP) corresponds to staying on base i for some time, moving forward (i i + 1, 
probability 1 — qi), without ever coming back to this base later on (probability Ei). This final pattern has 
probability Pc = {1 - qt) x Ei. 
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The number of closing transitions in a transition trace, di, is simply equal to the number of B patterns around base 
i. Similarly, the number of opening transitions, Ui, is the sum of the numbers Nj^ and Nq of A and C patterns 
respectively. As Nc = 1 by definition, we have Na = Ui — 1. A and B patterns can be randomly located in the 
transition trace and are followed by one C pattern, the distribution Pi{ui,di) on the ensemble of transition trace is 
therefore: 



Ui - 1 + rf, 



Pi{u,,d,) = (^^' J ' ^'j [(1 - q,) (1 - E,)] E, (1 - q,) . (u, > 1, > 0) . (Gl) 

Let us now focus on the total time ti spent on base i. It is the sum of Ui + di times each exponentially distributed 
with average sojourn time 

^ / n (hL fi-f- \ 7777 (^■2) 

Thus, Ti — ti/ {ti) is a stochastic variable obeying distribution Pji (P?)) where R = Ui + di plays the role of a fictitious 
number of unzippings. We obtain the joint probability of time Ti, opening and closing moves Ui and di, 

P, (T- u d ) - g/' [(l-g.)(l-i?0]"'-^i?.(l-gOe--T.'''+"'-^ 
Pi{T„u.,d,)- d,!(u,-l)! ■ 

Summation over all values for di lead to the (single base) probability for unzipping data 

Pi in, u,) = ~ [(1 - g.) (1 - E,) n] "'-^ e--(i-'-) . (G4) 
[ui - ly. 

Neglecting stacking effects between bases, the content hi of base i is chosen to maximize the probability 



eiqy{go{bi)u,^re3°ib^){t,)' 

P{b,\n,Ui) = ^ ^ ^ , (G5) 

exp (goiW) -resoi^^ (t,) tA + exp f go(5) u, ~ r eaois) (U) tA 

where the average sojourn time is given by eqn (|G2p . This maximization can be done along the lines of Section [IV A II 
devoted to the case of infinite force. We find the average fraction of mispredicted bases at force /, 



/• — /"OO 

with definition ((G4)) for Pi. Hence eqn 



{Ti,Ui) , (G6) 



2. Case of two-way unzippings 



We now suppose that the sequence is opened in both ways, and denote by ct — + the left-to-right and cr = — the 
right-to- left openings respectively. Let uf,T^ denote the number of openings of bp i and the time spent by the fork 
on i for both directions {a — zL). The joint distribution of uf , rf is Pi (|G4p with qi, Ei replaced with, respectively, gf, 
the probability to close back bp i when the fork is in i, and Ef , the escape probability from base i in the a direction. 
ql' and E^ are simply given by (jF2p and (|F5[) respectively. In addition q~ — q'j^_i^i, and E~ can be obtained along 
the lines of Appendix |F1 

As the unzippings in both directions produce statistically uncorrelated data the joint distributions of , t^ and 
, T^ factorize. The Bayesian probability that base i is of type hi is simply given by (|G5p with Ui = uf + , 

Ti — T^ + T^ . In the framework of Maximum Likelihood Prediction we maximize this quantity to obtain the error on 

base i, 

uf,u~>l 
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where 

/• + 00 -x (m+-1) -y {u~-1) 

_ = / dxdye(x + y-r'- (u+ + u-)) 

io ^ V V . * ; ' {u^, - 1)! K" - 1)! 

and p\ is defined in (|66p (beware of the dependence of i?f on the unzipping direction a). 

The generalization to the case of i?/2 unzippings in each direction is done along the lines of Section fG 1[ with the 
immediate result 

4!^"= H /'fl./2("i^) Pfl/2K^) e|^+ , (G9) 
where pji/2 is the (i?/2)*'' convolution power of the probability pi, see eqn ((68| . 



APPENDIX H: CALCULATION OF IN PRESENCE OF DNA STRANDS FLUCTUATIONS 

Let T[ be the number of measures where the fork is really at location i — 0,1, . . . , N . These integer numbers are 
stochastic and distributed according to, given the sequence B, 



The number of times the fork is apparently at position j, T"-, given the set of , is stochastic too. Their probability 
is given by 

Proba[{T;ni{T;'-}] = E 11 1 TTTl ^[^^^^]■^'^ '^(^;"' E Z^^) [ -^(^^ E (H2) 

where 5{a,h) = 1 if a = 6, otherwise is the Kronecker function, and the fluctuation matrix A is defined in (|89p . It 
is convenient to work with the generating function of the {Tf }, 

/ 1 _ p-Atr„(6i) \ 

Gi{{y,}\B) = E ^^oh^mmn] ^^oh^inm n^^^^^ = n i_,-a..„(..)v a ev. ■ ^^^^ 

The generating function of the probability distribution of the apparent times t° = Tj* x Ai is simply G\[\yj Ai}|i3). 

The above expression for G\ holds for one unzipping. For i? unzippings the generating function Gb. is simply given 
by the i?*'' power of G\. In the large R limit we obtain 

GB.{{y, M}\B) = exp [ - i? E ^^^(l + X^{{vM (H4) 

i 

where, to the first order in At, 



Assume now that the true sequence B^ is a repeated sequence of S bases with a W base at location n; we call i?'^ 
the sequence made of 5 bases only. We furthermore assume that the free energy difference A is small which makes 
inference harder. Using p defined in ([9T|) and introducing Sj — yj/ro{S), we obtain 

Gr{{s,}\B^) = exp [R j{{s,}\B^)] (H6) 
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where 

7({s,}|S^) = -^s,/i,(B^)-^^s,/3,-fcSfc + 0(4) with ft,(B^) = l-^+AA,-„ (H7) 



and matrix (3 defined in (P^ . Notice that the expressions for h and /3 were obtained using the approximation 
fc = 1 for any fc, and in the Hmit of small p, A. The expression for ^[{sj}\B^) is obtained from (jH7p when 
A ^6. 

We obtain the large deviation expression for the distribution Pji of the apparent times through the Legendre 
transform of 7, 



PR{{t", = Tyro{S)}\B) = exp(-i? u{{t^}\B) with uj{{t^}\B) = ~ max 



(H8) 



for the two sequences B = B'^^B^. When A is small we expect the distribution of apparent times for the two 
sequences to be very close and thus the set of times {t"}* for which they are equal will be close to the most 
likely apparent times with both distribution. This justifies the second order expansion in s in (|H7|) . The exponent 
Lo* = u!\{TfY\B^) = uj{{t^Y\B^) of the probability of this crossing time {t"}* is equal, in the large R limit, to the 
inverse of Rc{n). This statement can be graphically understood from the Figure 2 in [3J|. A more detailed explanation 
will be given in The calculation of uj* is immediate from (jHSp and leads to ([5^ . For p — the value for Rc{n) 
coincide with its expression (|32p in the absence of ssDNA fluctuation. 

We now turn to the analysis of the Viterbi algorithm in the limit At = 0. The Laplace transform of the probability 
distribution P^*'' of the deconvoluted time rf = tf ro{W) on base i is obtained from Gr by applying the deconvolution 
kernel D, with the result 

where Cij is defined in ([55)1 . The error in predicting base i is then given by the integral of P^'^ over rf > since 
bf = W, see (I23l28l) . 

/■°° , ^ r°° rl-r r+i°° 

4^ "'-^^ <'•'=/ 

where 

fix, s)=(^x+ Y^^) ' ~ I'^^l + ' ^"'^) ■ (Hll) 

The result for i? = 1 unzipping is given by ([M)) . In the large R limit we obtain expression ([M]) through a saddle-point 
calculation and a small s expansion (valid for small A). The saddle-point value for x can be located in 0, or in a 
strictly positive real value. This corresponds to the two cases listed in 



APPENDIX I: ON SAMPLING AND LARGE DEVIATIONS OF THE ERROR e 

We have calculated in Section pV D 2p the average fraction of mispredicted bases within the hypothesis of exact 
sampling of the distribution of the number Ui of openings. Let us turn to the more realistic case of a finite number 
of samples, M. As M decreases, the values of Ui with exponentially small-in-i? probabilities are less and less likely 
to be sampled, leading to large deviations corrections. Let us fix on a base pair dropping the base index i to shorten 
the notation. The values of u which can be found in a sample of size Al are the ones such that 

Pr{u) X M » 1 , (II) 

where pR is given in eqn (|68p . Assume that we keep fix R and scale the number of samples according to Af ^ e^^. 
Upon introduction of the rate function for u = u/ R, 

. in) = 1 lnpR[Ru) ^il-u) In ( Azl) + u In (|^) , (12) 
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FIG. 30: A. rate function uj{u) governing the large deviations of the number u of openings of a base per unzipping, u) vanishes 
when u equals its average value, (u), and is strictly negative otherwise. B. Rc vs. logarithm of the number of samples, 
fj, = InM/R, for the 9*'' base of the A-phage sequence. 
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FIG. 31: Logarithm /ii of the number of samples (divided by R) to obtain a good estimate of Rc{i) vs base pair index i. ni 



strongly depends on the base index i e.g. we need to sample over M < 
is / = 17.4 pN. 



to accurately estimate R^ for all bases. The force 



we rewrite condition (|IT|) into 



uj{u) > -fi , 



(13) 



This condition is graphically solved in Fig [HOK. At fixed fj, a compact range of available values for u is obtained, 
centered around the average number (u) of openings of a bp per unzipping. For instance, the smallest accessible 
value, is obtained when solving condition (|I3|) as an equality (Fig. [30K). 

For each sample m — 1, . . . , A/ the measured error takes value u = (if the base is correctly predicted) and 1 
otherwise, with probabilities 



Pv = 



1 



We evaluate this probability through a saddle-point approximation 



where 



i?c(/i) 



max 



Rc 



U~ RcLU (u) 



(14) 

(15) 
(16) 
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Let us call /ig = ^(wo) where uq is the root of lj'{uo) = and /ii = J^^^^^ = /.to + > /iQ. As a; depends on the 
bp i so do fiQ,fii. Then, 

• when ^ < fiQ the maximum on the r.h.s. of (|T6|) is reached in fulfilling the equality p3|l. and is an increasing 
function of (Fig [30^). 

• when fJ-Q < fi < fJ-i Rcip-) — -Rc(mo) = Rc does not depend on fi anymore (Fig I30B ). The average number of 
erroneous samples reads 

and is exponentially small in R by the very definition /xi, Hence no erroneous sample is detected and no estimate 
of Rc can be made. 

• when ^ > Merr is exponentially large (jl7|) . and the decay constant of the error can be safely measured and 
estimated to be Rc{fio)- 

Figure [nH shows ni, the logarithm (divided by R) of the number of samples needed to accurately estimate Rc, as a 
function of the base index i. We observe that /ii varies a lot from base to base. 



